Commonsense-Focused Dialogues for Response Generation: An Empirical Study

Smooth and effective communication requires the ability to perform latent or explicit commonsense inference. Prior commonsense reasoning benchmarks (such as SocialIQA and CommonsenseQA) mainly focus on the discriminative task of choosing the right answer from a set of candidates, and do not involve interactive language generation as in dialogue. Moreover, existing dialogue datasets do not explicitly focus on exhibiting commonsense as a facet. In this paper, we present an empirical study of commonsense in dialogue response generation. We first auto-extract commonsensical dialogues from existing dialogue datasets by leveraging ConceptNet, a commonsense knowledge graph. Furthermore, building on social contexts/situations in SocialIQA, we collect a new dialogue dataset with 25K dialogues aimed at exhibiting social commonsense in an interactive setting. We evaluate response generation models trained using these datasets and find that models trained on both extracted and our collected data produce responses that consistently exhibit more commonsense than baselines. Finally we propose an approach for automatic evaluation of commonsense that relies on features derived from ConceptNet and pre-trained language and dialog models, and show reasonable correlation with human evaluation of responses’ commonsense quality.


Introduction
Open-domain dialogue response generation (RG) models aim to provide human-like * Work done while Pei Zhou was an intern at Amazon Alexa AI 1 https://github.com/alexa/commonsense-dialogues. Note this released data is slightly different from the initial version used in the paper. The new data was manually checked by in-house crowd workers. natural language responses given dialogue histories (Chen et al., 2017).
To improve generated response quality, many studies have been conducted to develop knowledgegrounded RG (Ghazvininejad et al., 2018;Gopalakrishnan et al., 2019), personalized dialogue agents (Zhang et al., 2018), empathetic response (Rashkin et al., 2019), etc. For all the above-mentioned directions for RG, largescale dialogue data geared towards the specific goals is crucial, since most current state-ofthe-art neural RG models require training on appropriate and large data.
Therefore several datasets have been collected to support such research efforts such as knowledgegrounded dialogues (Ghazvininejad et al., 2018;Gopalakrishnan et al., 2019), Per-sonaChat (Zhang et al., 2018), and Empathet-icDialogues (Rashkin et al., 2019). Producing natural and logically-coherent responses given dialogue contexts involves making commonsense inferences during the communication. For example, if someone says "I'm going to perform in front of a thousand people tomorrow..." the listener is likely to conclude that the speaker is probably feeling nervous and respond by comforting them: "Relax, you'll do great!" In contrast to other efforts to make RG models more empathetic or knowledgeable, there is a lack of commonsense focused dialogue data for both training neural models and evaluation. An ideal dataset for studying commonsense in RG needs to simulate how humans have multi-turn conversations as much as possible. Existing commonsense-focused work in RG uses extracted post-response pairs from Reddit (Zhou et al., 2018), which are single-turn and rough approximations for real-life conversations.
Aiming to bridge the gap in commonsense for dialogue response generation, we collect a large-scale multi-turn open-domain dialogue dataset that is focused on commonsense knowledge. We first consider extracting commonsensefocused dialogues from three existing dialogue datasets by identifying responses that contain commonsense inferences using Concept-Net (Liu and Singh, 2004). This filtering results in 21k dialogues. Then we collect 25k new dialogues focusing on social commonsense inferences, where prompts are context sentences describing an event in the SocialIQA data (Sap et al., 2019b).
To study commonsense in RG, we train large generative language models on our datasets and compare with models trained on existing datasets. We find through sampled human evaluation that our dataset helps to generate more commonsensical responses (average score of 6.9 out of 10 compared to 4.8 using other data), and automatically generated responses still have a large gap in comparison to human performances (9.2 out of 10). To help lower the evaluation cost and increase the efficiency of evaluating commonsense in RG, we further propose an automatic metric using combined neural and symbolic features derived from ConceptNet, and show that this metric has reasonable correlation with human annotations and symbolic features contribute positively to system performance.
Our contributions are as follows: (1) We create the first large-scale open-domain dialogue dataset focusing on social commonsense inferences. This includes a new collection of 25k dialogues based on SocialIQA event prompts, and ConceptNet filtered data from some existing data sets. We are also releasing the ConceptNet filtered portion of our data collection, Commonsense-Dialogues, which contains about 11K dialogs. (2) We benchmark our dataset and show that models trained on our dataset helps make models produce more commonsensical responses. (3) We propose the first automatic metric for evaluating the commonsense plausibility in response generation that reaches statistically significant correlation with human annotations.

Commonsense-Focused Dialogue Response Generation
We study commonsense-focused response generation for dialogues. Commonsense can be defined as "the basic level of practical knowledge and reasoning concerning everyday situations and events that are commonly shared among most people" (Sap et al., 2020). Dialogue response generation is the task of generating a response turn r in a conversational setting given previous history turns h. Thus by combining these two together, we want to examine models' ability to produce responses that make sense or is plausible in terms of commonsense.

Motivations
Lack of Commonsense-Focused Analysis on Existing Dialogue Datasets Numerous dialogue data has been collected for training RG models and other dialogue-related tasks. As mentioned before, many different aspects of RG have been explored, such as knowledge-grounded (Ghazvininejad et al., 2018;Gopalakrishnan et al., 2019) andempathy (Rashkin et al., 2019), whereas, to the best of our knowledge, there is no study or large-scale multi-turn data for analyzing whether model-generated responses present the ability to communicate with commonsense knowledge or reasoning.
Lack of real-life interactive setting for Commonsense Reasoning Benchmarks Current commonsense reasoning (CSR) benchmarks mostly target models' ability to choose a right answer from several candidates given a question. We argue that this is a highly artificial scenario as models do not get options to choose from in reallife, and often they need to generate utterances. Recent work such as CommonGen  has started to explore generative settings to examine commonsense in natural language processing (NLP) models. This line of work, however, is still far from real use cases as it does not consider a real-life interaction task setup such as conversations. Thus we argue that existing commonsense benchmarks in NLP are not enough to train a language agent that produces smooth interpersonal communications, nor evaluate whether models have such capabilities.

Commonsense Focused Dialogue Collection
To collect more commonsense focused dialogues for response generation model training and evaluation, our effort is along two directions: filtering existing data to collect dialogues with responses that consist of commonsense (Section 3.1), and curating new data using prompts from a commonsense reasoning multiple-choice benchmark SocialIQA (Section 3.2).

Filtering Based on Existing Dialogue Datasets
We propose a simple process for filtering commonsense in dialogues and present our analysis of three dialogue datasets with different focuses: DailyDialog (Li et al., 2017), Empa-theticDialogues (Rashkin et al., 2019), and Mu-Tual (Cui et al., 2020). The general idea is to refer to a commonsense knowledge graph (CSKG) such as ConceptNet (Liu and Singh, 2004) to identify potential commonsense triples (e 1 , r, e 2 ) expressing a commonsense assertion between turns in a dialogue. The following describes the detailed process.
Identify Candidate Concepts The first step is to identify potential candidates for concept entities in the commonsense triples. For a turn in a dialogue, we use a part-of-speech (POS) tagger to find the nouns, verbs, and adjectives that are not stopwords and then construct a set of potential concepts by including the lemmatized version of these words. We use the POS tagger, lemmatizer, and stopword list from the Natural Language Toolkit (NLTK) package (Bird et al., 2009). This step results in a set of concept words for each turn of a dialogue. For example, consider an exchange between two participants in a conversation: "Hi, I want to find a doctor", "What kind of doctor are you looking for? A general doctor or a specialist?", the concept sets for the two turns are "want, find, doctor" and "look, general, doctor, specialist", respectively.

Query ConceptNet for Neighboring Entities
With a set of concepts we extract for every dialogue turn, we then identify a list of candidate triples (e 1 , r, e 2 ) expressing commonsense assertions about each concept such that we can later check if some of those assertions indeed appear in this dialogue. We rely on the widely-used Con-ceptNet (Liu and Singh, 2004) as the knowledge resource, which consists of commonsense knowledge about various concepts. Specifically we use the ConceptNet containing single-word concepts pre-processed by Zhou et al. (2018). For each concept we identified in a turn, we store all triples in ConceptNet that contain this concept, either as subject or object. Using the above example, example triples about "doctor" include "doctor Lo-cateAt hospital", "patient RelatedTo doctor", and "specialist TypeOf doctor". Search Entities in the Next Turn After getting a list of commonsense triples (e 1 , r, e 2 ) containing concepts in a particular turn using ConceptNet, we next examine if any of the other entity in the triples appears in the concept set of the next turn. In the example dialogue exchange above, where "doctor" is a concept appearing in a turn, for the triple "specialist TypeOf doctor", we search if "specialist" is in the concept set of the next turn. Since we find such a match, we record this triple to be a commonsense assertion that might be implied in the response.
Filtering Results We filter dialogues using the above-mentioned approach: if we can successfully find a matching triple between two adjacent turns, we keep the dialogue as it might contain commonsense assertions identified from ConceptNet. We consider three dialogue datasets in this study: • DailyDialog(DD) (Li et al., 2017). It includes general-domain day-to-day dialogues crawled from various English learning websites.
• MuTual (Cui et al., 2020). It is a reasoningfocused response selection dataset based on English listening comprehension exams for Chinese students.
We choose these three datasets to examine three different types of focuses in dialogue datasets: general-domain, empathy, and general reasoning (but not specifically on commonsense). After the process, we find that in the training sets, around 7k out of the 11k dialogues (63%) from Dailydialogue contain at least one matched triple between their turns, and 9.5k out of the 18k for EmpatheticDialogues (53%), and 5k out of 7k (73%) for MuTual dialogues. For the valid and test sets, the proportion of such dialogues is similar to that in the training sets for these three data sets.
Note that there are some limitations in our Con-ceptNet based data selection approach. First, we match concept entities based on just surface form, rather than semantic meaning or word senses in the context. Second, we are only using single word concepts, not phrases. Third, we are only considering one-hop concept relation identified in Con-ceptNet. The first one may affect the precision of the selected dialogues, and the other two reasons affect the recall. Without human annotated commonsense reasoning for dialog turns, we can not compute the exact performance of our filtering method. We plan to conduct some human annotation in our future work. Among the three data sets used in this study, the fact that there is a higher percentage of dialogues selected in MuTual may indicate that data focuses more on reasoning and thus is more likely to contain commonsense relations.

New Data Collection Using SocialIQA Prompts
To facilitate commonsense-guided response generation training, we collect more dialogues with a focus on getting responses that require commonsense. Specifically, we make use of an existing commonsense multiple-choice benchmark SocialIQA (Sap et al., 2019b) to crowdsource dialogues. This section provides background on So-cialIQA, the crowdsourcing process, and the resulting dialogues.
Background and motivation We collect dialogues by prompting crowdsourcing workers on Amazon Mechanical Turk (MTurk) with context sentences from SocialIQA that describe an event in everyday social scenarios. So-cialIQA (Sap et al., 2019b) is a large-scale commonsense reasoning benchmark about social situations. It contains around 38k multiple-choice questions, each consisting of a context sentence, a question, and three answer choices. Context was generated by rewriting events from ATOMIC (Sap et al., 2019a), a large knowledge graph (KG) that contains inferential knowledge about the causes and effects of 24k short events. An example event in ATOMIC is "PersonX spills all over the floor", which crowd workers were asked to turn into a sentence by adding names, fixing potential grammar errors, and filling in placeholders, resulting in a context like "Alex spilled food all over the floor." We choose to use SocialIQA contexts because of three reasons: (1) they are specific instantiations of the event phrases found in the knowledge graph ATOMIC, which guarantees that there is at least one potential commonsense inference that can be made from the event; (2) ATOMIC covers a wide range of commonsense motivations and reactions and thus the contexts also embed diverse commonsense; (3) the rewriting process from SocialIQA ensures that the context sentences are well-formed and similar to natural sentences, which we expect is not hard for crowd workers to come up with a dialogue.
Prompt selection We inspected around 200 contexts trying to write a dialogue and found that the contexts that we had the most difficulty with are the ones that are too short or do not contain an interesting event to start a conversation. For example, contexts such as "Robin stopped eating the food to save room for dessert" might not be an interesting event to talk about in a dialogue. To select appropriate contexts as prompts for dialog writing, we apply a simple heuristic criteria: the context has to be either longer than 15 words or contains a punctuation such as a comma or a period in the middle. The intuition is that longer contexts are easier to write a dialogue with because they contain more information and a punctuation often indicates a development in the narrative of the event (e.g., "Tracy performed her function. Their employer gave them a raise"). This makes the event more complicated, and thus avoids too trivial events. We also filter out context sentences that do not contain any person names. As a result of this preprocessing, we kept 12.8k out of 33k contexts in the training set and 754 out of 2k contexts in the development set, adding up to 13.5k contexts from SocialIQA.
Dialogue Collection Using selected contexts from SocialIQA, we ran a task on MTurk asking each worker to write a dialogue with 4 to 6 turns between two friends about the event described in the context. Note that, this is a 'self-talk' dialog collection. Specifically, since there is a name appearing in the context, we ask a worker to write a dialogue by first imagining that they are the person mentioned in the context and are talking with their friend about the event described. For example, consider the context above ("Tracy performed her function. Their employer gave them a raise"), we ask a worker to imagine themselves to be "Tracy" and that they are talking to a friend (also played by themselves) about getting a raise.
We pose three requirements for turkers in order to work on our task: locate in US, UK, or Canada;

Prompts Dialogue Examples
Tracy performed her function.
Tracy: I got a raise today. Totally unexpected. My boss told me I was doing a great job. Friend: It feels good to be rewarded for hard work. Tracy: I've been trying my best at this job. I've been putting in long hours to make sure I get everything done. Friend: Sounds like your boss recognized that. Tracy: It's great when people can work well together. Tracy: Get dressed. We're going out to celebrate my raise. Friend: Awesome. What did your boss say when you got it? Tracy: She said I did my job very well and deserved it. Friend: You should be so proud. You've earned it.
Addison wanted to go on a trip to Mexico, and messaged all of his friends to set up a schedule.
Addison: Hey guys! I'm planning a Mexico vacation for everyone! Let's work out a schedule so we can all do somethings we want to do together. Friend: I'm down! We should get in some scuba diving. I've been wanted to get some good underwater photos for my gallery. Addison: That sounds fun! I've never scuba dived before. Do you have to have any training? Friend: They give you a little course on how to use the equipment. You can opt out and just do the snorkeling if it's too intimidating. Addison: I think we'll go to Mexico next. Friend: That sounds exciting. Did you find a time that works for everyone. Addison: No! But I'm going to message them right now to find out! Friend: Yeah, You had better figure out a time as soon as possible. Scheduling is super hard with more than 3 people. Addison: Yep. But we'll get it done! My friends are the best at this! successful HITS are over 1000, and with more than 95% HIT acceptance rate. We pay MTurk workers $0.5 for each instance, roughly translating to 10 dollars per hour, well above the minimum wage of US.
To account for multiple plausible dialogues expanded from the context event, we assign each context to five different MTurk workers. We randomly sample 5k context sentences out of 13.5k filtered ones and collect five dialogues for each context, resulting in 25k dialogues. Examples of our collected dialogues are shown in Table 1.
For our collected data, we follow the same filtering steps as used for other existing data (Section 3.1). This ConceptNet filtering identifies about 11K dialog from the entire collection. 2 Though we expect the SocialIQA contexts are from ATOMIC and may trigger more commonsensical dialogue, we find this is not the case since the percentage of dialogues containing ConceptNet triples is even lower than what we observed for the other existing data sets. This may be because of the limitations of the filtering method we are using as described 2 Our released data is based on these ConceptNet filtered conversations. We recruited in-house crowd workers to manually check the dialogs for profanity, speaker name, and other issues in the data. Note the experiments conducted in this paper used the initial collection, not this released version. earlier: matching to ConceptNet is based on surface textual form and concepts are on word-level, which omits deeper and more contextual commonsense relationships.

Experiment Setup and Evaluation Methods
The focus of this study is to examine how commonsense plays a role in dialogue response generation. In previous sections, we propose a simple filtering method to obtain commonsense-focused dialogues from existing three datasets and crowdsource more dialogues based on the SocialIQA commonsense reasoning benchmark. Here we aim to evaluate response generation models' ability to produce responses that follow commonsense and if training on commonsense-focused dialogue data helps boost model performance. In addition to using automatic referenced metrics and human evaluation, we also propose a new automatic unreferenced metric aiming to evaluate responses for commonsense quality.

Experiment Settings
For response generation models, we take one of the state-of-the-art pre-trained language models, GPT2 (Radford et al., 2019), and further train it on our training data sets. Specifically, the model is trained in a multitask fashion that minimizes the LM loss as well as the multiple choice loss following Wolf et al. (2019), and generates responses for a given dialog history. We consider the follow three types of training data setups. is not included since it is designed for response selection.
• As described in Section 3.1, we use Concept-Net to search for potential triples in response turns and filter three dialogue datasets, DD, ED, and MuTual. We combine the three filtered dialogues from these datasets to form our training data, named 'filter existing' (FE, total around 21K dialogues).
• The third category includes our collected dialogues using SocialIQA contexts. This is used along with the FE data above: FE and all of the 25k collected dialogues (FE+new crowdsourced), and FE plus the 11K filtered dialogues of our collected data (FE+filtered crowdsourced).
To evaluate models' response generation capabilities, we sample 10% of the FE+new data, resulting in 4.6k testing dialogues with no overlap with the training set of any of the settings above. We use GPT2 trained on different versions of dialogue data (6 trained GPT2 models in total) to generate a randomly sampled response for each turn of our test set dialogues.

Evaluation Metrics
We perform automatic evaluation on the test set and human evaluation on sampled dialogs.

Automatic Evaluation
We consider several widely-used automatic metrics for evaluating response generation: perplexity of the reference responses in the data, Meteor score (Banerjee and Lavie, 2005), ROUGE score (Lin, 2004), and BERTScore (Zhang et al., 2019). Note that these metrics (except perplexity) provide general evaluation of the generated responses, but do not specifically focus on commonsense plausibility.

Human Evaluation
Since there is no existing evaluation method that reliably examines whether a response follows commonsense and correlates with human judgements, we ask humans to score system generated responses as well as the reference response given a dialogue history. We sample 300 history-response pairs from dialogues in our test set to perform human evaluation. All the model-generated responses from the 6 trained models above and the original response (human response) (around 2100 responses in total) are scored in terms of commonsense plausibility by MTurkers. We specifically asked workers to score the responses in terms of commonsense plausibility using a scale of 1 to 10. We also instructed them that criteria such as grammatical correctness and fluency should not be taken into much account and they should focus on evaluating the commonsense aspect of the response. Three annotators evaluated each response. We calculate the average human scores and variance to measure the performances of different responses.

Proposed Automatic Metric for Commonsense
Human evaluation is expensive to obtain, especially when the dataset is large. In addition, they are also subjective and hard to reproduce. Aiming to provide a reliable and scalable automatic metric focusing on commonsense in response generation, we propose an unreferenced automatic metric, which is a regression model trained from the human annotation scores for different responses. The metric is reference-free, meaning that it does not require human ground truth response when scoring a model-generated response, unlike referenced metrics such as BLEU, ROUGE, Meteor.
Regressor model We use a simple multi-layer perceptron (MLP) as our regressor and consider both neural and symbolic features to train the MLP model. For symbolic features, we consider the number of one-hop and two-hop triples that can be found between the dialogue history and the response turn from ConceptNet. The triple identifying process is the same as our filtering process described earlier (Section 3.1). That is, we first identify a set of concepts in the response turn and query ConceptNet for potential triples and match those with the other concepts appearing in the di-alogue history. Two-hop triples are searched in a similar manner, with the only difference being that the number of potential triples will be much larger. We also include the length of the response as an additional feature. As for neural features, we use the scores from a dialogue-focused language model DialoGPT  on both the response itself and the dialogue history concatenated with the response. The score from DialoGPT can be considered as the plausibility of the sentence. We train this MLP model using the human evaluation scores for different responses. Table 2 shows results according to automatic metrics on our 4.6K testing dialogues. We find that perplexity scores for the GPT2 model trained on filtered existing dialogue data (FE), or plus new collected data (FE+Crowdsourced), are much lower than that just trained on existing datasets as is. There are several reasons for this. One is that since the testing dialogues are from the filtered version, training on those better matches the evaluation scenario. In addition, the test set is a sample of multiple data sets, and thus training on just one data set does not perform well. Finally the combined data (the last three rows in the table) is larger in size (see training size in Table 3). However, note the gain from the increasing training data size decreases in comparison to the difference between using the filter data settings and those single data sets. Meteor and ROUGE scores for all the trained models are quite low, and show less differences, probably indicating the limitation of these metrics for dialog response evaluation. BERTScore shows a similar pattern as perplexity in terms of model quality.   Table 3 shows the human evaluation scores on 300 responses for models trained with different types of data. The most obvious and perhaps expected finding is that GPT2, no matter trained on what types of data, is still way behind human performance (6.86 with high variance versus 9.3 with low variance). By analyzing different variables that cause performance difference, we find the following patterns, some of which are similar to using automatic metrics.

Human Evaluation Results
(1) Using the Filtered Existing dialogue data (FE) helps improve the average of commonsense scores (more than 1 point improvement comparing to using individual data sets), but variance remains high; (2) Including our collected dialogues further increases the average (FE+Crowdsourced), and also decreases the variance in response quality in terms of commonsense plausibility; (3) Regarding our collected data, using the filter subset of it yields slightly better performance than using the entire data collection. This suggests that even though our data is collected using SocialIQA events, some dialogues may not be commonsense rich, which is also reflected by the percentage of dialogues that contain ConceptNet triples as discussed earlier. In addition, it shows that though overall increasing training data size benefits model performance, the quality of data plays a more important role. We plan to perform more sophisticated data selection and commonsense annotation for our data set in the future.
We include examples of responses from humans and models trained on these different types of data as well as annotation scores in Appendix A Table 5. It shows some different characteristics of the responses, for example, empathy in the responses using ED model, and richer information (though inappropriate since they are off topic) using TC model.

Proposed Commonsense Automatic Evaluation Results
We now examine the correlation of our proposed automatic metric (MLP regressor) with human scores on the testing portion of our annotations. We cross-validate on the collected dialogues with 0.8/0.1/0.1 proportions. For comparison, we consider three baselines: our MLP with only symbolic features, our MLP with only neural features, and FED (Mehri and Eskenazi, 2020a), which uses Di-aloGPT to score how likely the next turn after the response expresses confusion. It requires no training nor human references, and has been shown to correlate with humans judgements on different criteria (commonsense not included). Table 4 shows the Spearman's correlation of the system computed scores and human annotation scores using all the annotated data in a cross-validation setup. We can see that our simple MLP-based regressor reaches the highest spearman's correlation with human scores, outperforming other baselines significantly. However, such a correlation result still suggests a large gap for a reliable scorer targeting commonsense evaluation for dialogue response generation. We also notice that FED performs poorly in terms of commonsense evaluation. Furthermore, there is a large correlation drop when considering either symbolic or neural features alone in our model, indicating that they might each capture a different aspect for evaluating commonsense.  The majority of recent commonsense reasoning benchmarks (Zellers et al., 2018;Talmor et al., 2019;Bisk et al., 2020;Sap et al., 2019b) test a model's ability to choose the correct option given a context and a question; pre-trained language models have reached high performance on these benchmarks after fine-tuning. There have been many benchmarks that focus on reasoning abili-ties in multiple tasks such as reading comprehension (Huang et al., 2019;, dialogue systems (Cui et al., 2020), and natural language inference (Williams et al., 2018), which involve inferences on language. Recent work also aims to probe models in these tasks to see if reasoning is actually achieved .
In this study we tackle the response generation problem in dialogues, with a focus on collecting commonsense rich dialog data and evaluating commonsense quality of model responses.

Open Domain Dialogue Response Generation
Recently open domain dialog systems have been modeled using end-to-end approaches, more specifically encoder-decoder architectures (Sordoni et al., 2015;Vinyals and Le, 2015). Recent work focused on finetuning large pre-trained transformer models (Radford et al., 2019; on dialog data. Many dialog datasets have been collected with different focuses such as incorporating knowledge (Gopalakrishnan et al., 2019;, empathy (Rashkin et al., 2019), task completion (Budzianowski et al., 2018), consistency (Nie et al., 2020), personality (Zhang et al., 2018) and reasoning (Cui et al., 2020) within dialog systems. There has also been work on combining a variety of datasets to exhibit multiple attributes (Roller et al., 2020).

Dialog Response Evaluation
Due to the diverse responses that a dialog system can output, referenced automatic metrics (such as BLEU, ROUGE, Perplexity) do not correlate well with human judgement of these systems (Deriu et al., 2020;Liu et al., 2016). As a result, human evaluation has become the de-facto standard to evaluate dialog systems. However human evaluation is costly. Recently model-based metrics have been proposed with good correlation with human annotations (Zhang et al., 2019;Sellam et al., 2020;Mehri and Eskenazi, 2020b,a;Tao et al., 2018;. Most metrics focus on evaluating the coherence or appropriateness of a response with respect to its dialog context. (Mehri and Eskenazi, 2020a) identified 18 different dialog qualities such as interesting and topic depth. However none of these metrics evaluate the commonsense of a response, which is the focus of this work.

Conclusion
We present our empirical study on commonsense in dialogue response generation. To obtain data for commonsense-focused analysis in open domain response generation, we use two strategies: filtering existing dialogue data using a commonsense knowledge graph ConcepetNet, and collecting new dialogues using prompts from multiplechoice commonsense benchmark. Our data has a few limitations such as our filtering process focuses on word-level matching to ConceptNet, which might omit more complex commonsense relations embedded in dialogues. We leave deeper analysis of how implicit commonsense is represented in dialogues and how to elicit multi-hop granular reasoning steps during communications to future work. Our experimental results show that our newly collected data helps boost response generation model performance based on human evaluation of commonsense. To close the gap in automatic evaluation metric for response generation, we propose a simple MLP regressor using both neural and symbolic features, and show promising correlation with human judgements.
We are releasing the ConceptNet filtered portion of our data collection, with further manual examination, Commonsense-Dialogues, which consists of about 11K dialogs. We hope our work and this new data will help with future attempts to make models produce responses with more commonsense, which is a challenging but crucial task to tackle in dialog systems.

Ethics and Broader Impact
Our work uses ConceptNet to filter for commonsense-focused dialogues, but Mehrabi et al. (2021) have found representational harms in common sense resources. We acknowledge that the generated responses from models we use might contain biases. All of the dialogue datasets and models are in English, which benefits English speakers more. We used Amazon Mechanical Turks for data collection. We pay turkers around $14 per hour, well above the highest state minimum wage and engage in constructive discussions if they have concerns about the process. We also give each annotation instance enough time so that we do not pressure annotators.