NUANCED: Natural Utterance Annotation for Nuanced Conversation with Estimated Distributions

Existing conversational systems are mostly agent-centric, which assumes the user utterances would closely follow the system ontology (for NLU or dialogue state tracking). However, in real-world scenarios, it is highly desirable that the users can speak freely in their own way. It is extremely hard, if not impossible, for the users to adapt to the unknown system ontology. In this work, we attempt to build a user-centric dialogue system. As there is no clean mapping for a user's free form utterance to an ontology, we first model the user preferences as estimated distributions over the system ontology and map the users' utterances to such distributions. Learning such a mapping poses new challenges on reasoning over existing knowledge, ranging from factoid knowledge, commonsense knowledge to the users' own situations. To this end, we build a new dataset named NUANCED that focuses on such realistic settings for conversational recommendation. Collected via dialogue simulation and paraphrasing, NUANCED contains 5.1k dialogues, 26k turns of high-quality user responses. We conduct experiments, showing both the usefulness and challenges of our problem setting. We believe NUANCED can serve as a valuable resource to push existing research from the agent-centric system to the user-centric system. The code and data is publicly available at \url{https://github.com/facebookresearch/nuanced}.


Introduction
Conversational artificial intelligence is one of the long-standing research problems in natural language processing, such as task-oriented dialogue (Wen et al., 2017;Budzianowski et al., 2018;Hosseini-Asl et al., 2020), conversational recommendation (Sun and Zhang, 2018; and chi-chat (Adiwardana et al., 2020;Roller et al., 2020) etc. However, most existing systems are agent-centric. Such systems require the users to unnaturally adapt to and even have a learning curve on the system ontology, which is largely unknown * Work done as a research intern at Facebook. 1 https://github.com/facebookresearch/ nuanced System ontology: category: Japanese, Korean, Chinese, New American, etc. alcohol: full bar, beer and wine, don't serve attire: casual, dressy, formal wifi: free, paid, no full bar please. to the users (such as the sample instructions for most smart speakers). Figure 1 shows a dialogue snippet commonly found in traditional datasets: the user is expected to closely follow the system ontology with the exact ontology terms, or at most with minor variations like synonyms.
In the real-world use cases, such formulation may easily results in information loss, or breaks a conversation if the user speaks anything out of the system ontology; In this work, we argue that a smart agent can ideally be more user-centric, by allowing users to speak freely without restrictions. The system is expected to uncover the connection between the freestyle user utterance and one or more slots and values by the system ontology.
To build a user-centric dialogue system, we propose to model the mapping from the free form user utterances to the system ontology as probability distributions to capture fine-grained user preferences. To learn the distributions, we construct a new dataset, named NUANCED (Natural Utterance Annotation for Nuanced Conversation with Estimated Distributions). NUANCED targets conversational recommendation because such type of dialogue system encourages more modeling of soft matching and implicit reasoning for user preference. We employ professional linguists to annotate the dataset, and end up with 5.1k dialogues and 26k turns of high-quality user utterances. Our dataset captures a wide range of phenomena naturally occurring in realistic user utterances, including specified factoid knowledge, commonsense knowledge and users' own situations. We conduct comprehensive experiments and analyses to demonstrate the challenges. We hope NUANCED can serve as a valuable resource to bridge the gap between current researches and real-world applications.

Related Work
Task-oriented dialogue systems are typically divided into several sub modules, including user intent detection (Liu and Lane, 2016; Gangadharaiah and Narayanaswamy, 2019), dialogue state tracking (Rastogi et al., 2017;Heck et al., 2020), dialogue policy learning (Peng et al., 2017;Su et al., 2016), and response generation (Dusek et al., 2018;Wen et al., 2015). More recent approaches begin to build unified models that bring the pipeline together (Chen et al., 2019;Hosseini-Asl et al., 2020). Conversational recommendation focus on combining the recommendation system with online conversation to capture user preference (Fu et al., 2020;Sun and Zhang, 2018;. Previous works mostly focus on learning the agent side policy to ask the right questions and make accurate recommendations, such as (Xu et al., 2020;Lei et al., 2020;Li et al., 2020;Penha and Hauff, 2020). Chit-Chat (Adiwardana et al., 2020;Roller et al., 2020) is the most free form dialogue but almost with no knowledge grounding or state tracking. Both existing task-oriented, conversational recommendation systems have a pre-defined system ontology as a representation connected to the back-end database. The ontology defines all entity attributes as slots and the option values for each slot. In existing datasets, such as the DSTC challenges (Williams et al., 2014), Multi-WOZ (Budzianowski et al., 2018), MGConvRex (Xu et al., 2020), etc, the utterances from the users mostly closely follow the system ontology. While in task-oriented dialogue systems, parsing the user utterances into dialogue states is more on hard matching, in conversational recommendation systems soft matching is more encouraged since the user preferences are more salient and diverse in this type of conversations.

User Preference Modeling
Given a system ontology, denote the set of all slots as {S i }, with the option values for each slot as {V j i }. Denote the current user utterance as T and dialogue context (of past turns) as C. We model the user preference as a distribution over each slotvalue, namely preference distribution: Note that we expect the representation to be general, expandable, and to hold the fewest assumptions, i.e., there is no assumption on the dependency among slot-values, nor the completeness of the value set. Therefore we model the distribution as a Bernoulli distribution over each slot-value. Intuitively, P j i represents the probability that the user chooses an item with attributes V j i , under the observed condition of the dialogue up to the current turn. Note that the preference distributions may differ among individuals which causes variances, In this work, we aim to aggregate estimated distributions from large-scale data collected from multiple workers as "commonsense" distributions. We leave modeling user-specific distributions to future work.

Dataset Construction
We first simulate the dialogue flow with the preference distributions, then we ask the annotators to compose utterances that imply the distribution.

Dialogue Simulator
We follow the approach from the MGConvRex dataset (Xu et al., 2020) to build the user visiting histories from real-world data. For each user with its visiting history as a list of restaurants with slot-values, we sample a subset of the history and aggregate to get a value distribution for each slot. For example, in the list of restaurants of a user's visiting history, we sampled two restaurants, restaurant 1 and restaurant 2. Restaurant 1 has the slotvalues of Alcohol = full_bar, Restaurant 2 has the slot-value of Alcohol = beer_and_wine. Then the aggregated distributions is Alcohol = (full_bar, 0.5), (beer_and_wine, 0.5), (no_serve, 0.0). As generally, for the same user, the attributes of its visited restaurants tends to follow certain trends. Therefore the aggregated distributions created this way can be more natural. Using the sampled distribution as the ground truth distribution, we simulate the dialogue skeletons of the following scenarios: 1) Straight dialogue flow: the system asks each slot, followed by the user response filled with preference distributions; 2) User updating preference: the user updates the preference distributions in a previous turn; 3) System yes/no questions: the system can choose to ask confirmation questions; For each turn, we randomly select 1 to 3 slots, corresponding to the cases that the user utterances naturally imply multiple slot-values. The system turns are composed using templates.

User Utterances Composition
After simulating the dialogue skeletons, we employ professional linguists to do the composition to ensure high quality. We provide two composing strategies: Implicit Reasoning: do not mention the slot-value terms explicitly. This is the focus of this work because we expect that users are unaware of the system ontology and to depict their requests naturally. Explicitly Mention: use the slot-value terms (or synonyms), as a backup option when the first one is not applicable. We also emphasize the following aspects: 1) Read the whole dialogue first to have an overall "story" in mind before composing each utterance to ensure consistency; 2) Try to compose utterances as diverse as possible; 3) Reject any cases with invalid or unnatural preference distributions. We provide learning sessions to linguists to ensure they all master the tasks.

Dataset Statistics and Analysis
With an average of 5.39 user turns per dialogue, we have 5,100 dialogues consisting of 25,757 user turns. The user utterances have an average length of 19.43 tokens. 84.7% of the utterances are composed using implicit reasoning; 6.5% of the utterances explicitly mention the ontology terms, and the rest use mixed strategies. The train / valid / test split is 3,600 / 500 / 1,000 in the number of dialogues, and 18,182 / 2,529 / 5,046 in the number of user turns. To evaluate the quality of our dataset, we randomly sample 500 examples and ask the linguistics whether a preference distribution is rea-sonable based on the corresponding utterance. We end up with a turn-level correctness rate of 90.2%.
Among the utterances involving implicit reasoning, we summarize 3 basic reasoning types. The examples are shown in Table 1. Type I (Factoid Knowledge) is largely agreed on by people and is relatively stable. Type II (Commonsense Knowledge or User Situations) may not be formally defined. For example, a food item less than $10 is considered cheap. In many cases, such knowledge needs to be inferred from a situation described by users. Type III (Mix of Type I and II) may appear in a single utterance.

NUANCED-reduced
We also provide a reduced variant called NU-ANCED-reduced, by discretizing the distributions for preference into binary numbers. For all slotvalues with a positive preference distribution 2 we label them as 1.0, otherwise 0.0. This reduced variant does not have continuous probabilities to tell the nuanced differences but it still needs to map free form utterances to binary labels. We conduct human evaluation by asking the annotators to decide which representation can better capture more fine-grained user preferences. As Table 2 shows, NUANCED can better capture the nuanced information. Note that in real applications, which version of the data to use may depend on requirements of the system, i.e., level of granularity for state representation.

Experiments
In this section, we conduct experiments on both versions of the datasets in §4.1 and §4.2, respectively.

Baselines Exact match & Random guess
We follow the preceding system query to make slot prediction; we then use an exact match to predict the slot-values; if no match is found, we apply a random guess. BERT (Devlin et al., 2019), The input is the concatenation of the slot name, current turn system question and user utterance, and the dialogue context of past turns. We add two types of prediction heads on the [CLS] token of BERT, one for slot prediction (whether the input slot is updated or not), and the other for the value prediction of each slot.  The loss is a combination of cross-entropy loss for slot prediction and mean squared error (MSE) loss for value prediction. During inference, we set up a threshold to decide positive or negative predictions.
Transformer (Vaswani et al., 2017) We use the similar architecture as the BERT baseline but train the weights from scratch. Train-ConvRex As MGConvRex dataset (Xu et al., 2020) has similar domain and ontology, we compare the BERT model trained on MGConvRex 3 with that tested on NUANCED-reduced. We use this baseline to demonstrate the open challenges caused by users' free-form speaking. We refer the readers to Appendix A for more details. For all baselines, we evaluate on the turn level slot prediction accuracy and joint accuracy.

Results for NUANCED-reduced
As shown in Table 3, the BERT model achieves the best performance as the external knowledge obtained from pre-training helps draw a better relevance between unrecognized entities from the user and entities from the agent. Train-ConvRex limits such mapping to system ontology, indicating that existing dialogue datasets may limit what an agent can understand from users. Lastly, by comparing with BERT without dialogue context (or past turns), we notice that context may help in learning better values but yields more noise for slot prediction.

NUANCED
4.2.1 Baselines Exact match & Random guess Similar to NU-ANCED-reduced, we assign a probability of 1.0 3 We contacted the first author to obtain the dataset.

Baselines
Slot Accuracy (%) Joint Accuracy (%)  for matched values or random value otherwise.
BERT, Transformer Similar to NUANCEDreduced, we use MSE loss between the ground truth and the predicted distribution.
Train-reduced-X We train the model on NU-ANCED-reduced and test on NUANCED to see how data with binary states can infer states in the continuous space. We define a fixed number of X as the continuous number for all positive predictions. We experiment with X = 0.5 and 1.0.
We keep the same evaluation for slot prediction. For value predictions, we evaluate the soft average mean absolute error (MAE) between the ground truth distribution and the predictions. Table 4, BERT reaches the best performance. One interesting observation is that using the same model BERT, the slot prediction accuracy increases (from 88.21% to 89.62%) compared with training on the reduced version. NUANCED helps to reduce the noise of sparse entities in context (past turns). This is probably because numbers in continuous space can draw more relevance among different entities. As we can see, Train-reduced-X has a much larger error. This indicates that simply adapting the results from the reduced state labels suffers from information loss, i.e., the nuanced differences in continuous distributions.

Analysis on Slots
We study how the models perform on different kinds of turns, shown in Table 5. Generally speaking, the turns with more slots are relatively harder to learn. The turns that update the preference in previous turns have the highest error, the preference distribution needs to be jointly inferred from the previous mention and the current turn. We also study the performance on each slot in Appendix B, and provide some case studies in Appendix C.

Human Evaluation
We further conduct a human evaluation on baseline models. We first evaluate the model outputs of Transformer, BERT, and BERT w/o context, through pairwise comparison between the model predictions and the gold labels. The results on 200 samples are shown in Table 6. There is a large gap between the best-performing baseline and the gold reference, which indicates significant room for improvement for future research. Further, we study the breakdown of predictions of BERT on 3 different types of reasoning. As shown in Table 7, the type 1 utterances, that involve factoid knowledge, are relatively harder to learn. This is close to our intuition because factoid knowledge is huge (and keeps increasing) and the limited utterances in the dataset may not cover all of the knowledge.

Conclusion and Open Problems
Starting from our dataset, we believe the usercentric dialogue system is an open-ended problem and the following directions are worth pursuing:   1) Preliminary experimental results indicate that to improve performance, it is promising to incorporate external domain texts into pre-trained models, for example, pre-training the model on domain corpora like restaurant descriptions and reviews. 2) Although our dataset collects a large set of domain entity knowledge, we still cannot guarantee that it will cover the vast amount of unknown entities in the future. One idea is to incorporate a knowledge base (KB) in the form of data augmentation or modeling. 3) Through our large-scale dataset, although one can learn a general agreement of estimated distributions from the crowds, a more user-specific distribution would be more desirable. We believe providing a personalized solution is another proper next step to consider.   Figure 2 presents the architecture of the BERT baseline. For each turn, we concatenate each slot with the current turn and the dialogue context as the input. On the [CLS] output, we add one head for slot prediction as binary classification, i.e., whether the input slot is updated in the current turn. For each slot, we add a specific head for value prediction. We use cross entropy loss for slot prediction, and mean squared loss for value distribution prediction. The overall loss is a weighted combination of the two losses. We set the weight for value prediction as 20.0. The threshold for value prediction in NUANCED-reduced is set as 0.5. We use BERT-base uncased model from the official release 4 with 110M parameters; The learning rate is set as 3e-5, batch size as 32. We take the results based on the performance on validation set. For NU-ANCED-reduced, the training takes around 25,000 gradient steps; For NUANCED, the training takes around 40,000 steps. For the transformer model, to achieve best performance we use 6 layers and hidden size 300. All training is done on a single NVIDIA TESLA M40 card with 11G memory.
Note that for the slot "food category", some values are commonly observed in the dataset such as "American food", "nightlife", while some others are less frequently such as "Thai". During training we employ up-sampling for the less frequent ones.
In the construction of NUANCED, we sample a subset of the user history and aggregate to get the ground truth preference distributions. Because the number of viable values of each slot is different, for those slots with relatively more values the distribution generally presents 'long tail', we only take the top 3 value distributions for each slot. Correspondingly, during the model evaluation, we also take the top 3 predicted value distributions to calculate the MAE.

B Analysis on Slots
We also study how the model performs on each slot in the domain, shown in Table 8. Generally, slots that may involve more factoid knowledge or more choices of values are harder to learn, such as food category, parking. These may require learning long-tailed knowledge from external data.     Table 9: Some case studies. the last example shows two turns in a dialogue and corresponding distributions for each turn. The user updates the preference in a later turn based on a previous turn.