“What do others think?”: Task-Oriented Conversational Modeling with Subjective Knowledge

Task-oriented Dialogue (TOD) Systems aim to build dialogue systems that assist users in accomplishing specific goals, such as booking a hotel or a restaurant. Traditional TODs rely on domain-specific APIs/DBs or external factual knowledge to generate responses, which cannot accommodate subjective user requests (e.g.,”Is the WIFI reliable?” or “Does the restaurant have a good atmosphere?”). To address this issue, we propose a novel task of subjective-knowledge-based TOD (SK-TOD). We also propose the first corresponding dataset, which contains subjective knowledge-seeking dialogue contexts and manually annotated responses grounded in subjective knowledge sources. When evaluated with existing TOD approaches, we find that this task poses new challenges such as aggregating diverse opinions from multiple knowledge snippets. We hope this task and dataset can promote further research on TOD and subjective content understanding. The code and the dataset are available at https://github.com/alexa/dstc11-track5.


Introduction
Task-oriented Dialogue (TOD) Systems aim to build dialogue systems that assist users in accomplishing specific goals, such as booking a hotel or a restaurant.Most solutions of TOD are based on domain-APIs (Budzianowski et al., 2018;Rastogi et al., 2020) and structured databases (Eric et al., 2017;Wu et al., 2019), which can only handle a limited range of scenarios within the scope of APIs/DBs.To further enlarge the model's ability of task-oriented assistance, recent works (Dimitrakis et al., 2018;Kim et al., 2020Kim et al., , 2021;;Feng et al., 2020Feng et al., , 2021;;Majumder et al., 2022) incorporate unstructured textual information retrieved from the Internet into dialogue modeling.Most of these works focus on factual knowledge sources such as frequently asked questions (FAQs) of online prod-Figure 1: Examples of the SK-TOD task.The top part shows two hotels and their customer reviews.The bottom part shows three dialogue sessions between the system (denoted by S) and three users (denoted by U).The last user utterance is a subjective question about the WIFI quality of the hotel(s).The system needs to retrieve information from the relevant subjective knowledge, which is highlighted in the review text.ucts or government service guides.We refer to these models as Fact-TOD models.
However, in many TOD tasks, users care about not only factual information but subjective insights as well, such as the experiences, opinions, and preferences of other customers.For instance, when booking a hotel or a restaurant, users often inquire about subject aspects like "Is the WIFI reliable?"or "Does the restaurant have a good atmosphere?".
To respond to such user requests, an agent needs to seek information from subjective knowledge sources, such as online customer reviews.While subjective knowledge has been specifically studied in other NLP problems such as opinion mining (Liu and Zhang, 2012) and question answering (Bjerva et al., 2020), incorporating it into TOD has not received significant attention.
In this work, we argue that it is important to enable the TOD model to leverage subjective knowledge for more effective task-oriented assistance.To this end, we propose a novel task of subjective-knowledge-based task-oriented dialogue (SK-TOD).SK-TOD focuses on responding to user requests that seek subjective information by incorporating user reviews as subjective knowledge.Figure 1 shows three examples of such requests, where customers ask about the WiFi quality of various hotels.User reviews are valuable resources for subjective information because even for the same aspect of a product or service, customers may have different opinions and leave either positive or negative reviews.As a result, a TOD system should consider multiple reviews to provide a comprehensive representation of user opinions.Ideally, the system's response should include both positive and negative opinions, along with their respective proportions (as exemplified in Dialogue 3).This two-sided response has been recognized as more credible and valuable for customers (Kamins et al., 1989;Lee et al., 2008;Baek et al., 2012), thereby fostering trust in the TOD system.
Incorporating subjective knowledge into TOD introduces two unique challenges.Firstly, unlike in Fact-TOD where selecting a few relevant knowledge snippets suffices, the SK-TOD model must consider all relevant knowledge snippets.In other words, both precision and recall matter during this process.Secondly, the model needs to aggregate these knowledge snippets into a concise response that can faithfully reflect the diversity and proportion of opinions expressed.Conquering these challenges requires a large-scale dataset with subjective-knowledge-grounded responses, which, to our best knowledge, is not publicly available.
To facilitate the research in subjectiveknowledge-grounded TOD, we have collected a large-scale dataset, which contains 19,696 subjective knowledge-seeking dialogue contexts and manually annotated responses that are grounded on 143 entities and 1,430 reviews (8,013 sentences).We evaluate the performance of strong baselines on the SK-TOD task.Results show that there is a large gap between human-generated and machine-generated responses, particularly in terms of the faithfulness of the sentiment proportion.To address this issue, we propose a model that incorporates review understanding into SK-TOD.We experimentally demonstrate that responses generated by this model more effectively capture the sentiment proportion.Our contributions are three-fold: • We introduce a novel task of subjectiveknowledge-based TOD (SK-TOD);  et al., 2018;Liu et al., 2018) and knowledge graphs (Zhang et al., 2020a;Moon et al., 2019;Tuan et al., 2019), to unstructured knowledge such as Wikipedia articles (Vougiouklis et al., 2016;Zhou et al., 2018;Dinan et al., 2018), news articles (Majumder et al., 2020), web pages (Long et al., 2017;Galley et al., 2019;Komeili et al., 2022), narratives (Xu et al., 2021;Gopalakrishnan et al., 2019), user reviews and comments (Moghe et al., 2018;Ghazvininejad et al., 2018), and so on.Grounding on external knowledge makes the response more informative and meaningful when compared with models that are merely based on the dialog context.
In the task-oriented dialogues, most works focus on domain-specific APIs and databases to support the dialogue response (Levin et al., 2000;Singh et al., 2002;Williams and Young, 2007;Eric et al., 2017;Wu et al., 2019), which can only support a limited scope of user queries.Later works ground task-oriented dialogues to web pages (Penha et al., 2019;Chen et al., 2022), government service documents (Saeidi et al., 2018;Feng et al., 2020Feng et al., , 2021)), and FAQ knowledge snippets (Kim et al., 2020(Kim et al., , 2021)).Different from these works where factual knowledge is utilized, we apply subjective knowledge to generate the response and ground in multiple knowledge snippets.While Majumder et al. ( 2022) also ground TOD on user reviews, they did not consider the diversity of opinions.Table 1: Comparison between SK-TOD and other benchmarks based on the subjective content.We consider if the dataset is manually annotated, dialogue-based, task-oriented, and query-focused.We also list if it considers aspect and sentiment, multiple knowledge snippets (Mul-Knwl), and the proportion of two-sided sentiments (Senti-%).

Subjective Content Understanding
Besides being used as external knowledge sources in dialogue systems, subjective content, especially user reviews, has been studied in some nonconversational NLP tasks.For example, opinion mining (Pontiki et al., 2016;Jiang et al., 2019) aims to extract opinions and sentiments from user reviews.Opinion summarization (Chu and Liu, 2019;Angelidis et al., 2021;Bražinskas et al., 2020) is used to distill multiple opinions into a concise summary.Subjective question answering (McAuley and Yang, 2016;Bjerva et al., 2020) is proposed to answer questions based on the user reviews.Explainable recommendation (Ni et al., 2019) aims to generate review-based explanations for the items proposed by a recommendation system.Table 1 provides detailed comparisons between SK-TOD and these subjective-content-based benchmarks.Generally, SK-TOD requires creating a response that is appropriate to the dialogue context.It also requires grounding in multiple subjective knowledge and explicitly considers the diversity of opinions and the proportion of sentiments.

Problem Formulation
Formally, we have between a user and a system, where each user utterance U i is followed by a system response utterance S i except the last user utterance U t .The dialogue involves one or more entities represented as Along with the dialogue, we have a subjective knowledge source B = {(e 1 , R 1 ), (e 2 , R 2 ), • • • } consisting of all the entities and their corresponding customer reviews.Each entity e has multiple reviews

Data Collection and Statistics
We ground the data collection in MultiWOZ (Budzianowski et al., 2018;Eric et al., 2020).We select dialogues from the domains of hotels and restaurants.The data collection is conducted by a group of crowd workers through Amazon Mechanical Turk.To control the data quality, we only choose workers that are pre-qualified.More details can be found in Appendix A.

Annotation Guideline
Dialogues in MultiWOZ are collected based on single or multiple entities as the back-end database.To create a subjective knowledge source to support the SK-TOD task, we first collect multiple user reviews for each entity.To control the review collection, we provide the reviewer's persona, as well as the aspects and sentiments of reviews to workers.We then ask workers to write a review with all the given information included.After collecting the reviews, we also annotate the aspect and sentiment information for each review sentence.Overall, we select 33 hotels and 110 restaurants from Multi-WOZ, and collect 10 reviews for each entity.Each review contains 5.6 sentences and 56.71 tokens on average.More details about the review collection can be found in Appendix A.
After collecting reviews, we go back to the dialogue data to create the subjective user request.Following the procedure of (Kim et al., 2020), for each dialogue, we provide an aspect that users are interested in (e.g., WIFI-quality of the hotel) and then ask the worker to insert a subjective user request into the dialogue.Workers are requested to carefully select the insertion position and write an utterance to maintain the coherence and naturalness of the entire dialogue flow.Finally, we use the partial dialog until this newly inserted turn as an instance in our data.Utterances after the insertion position are removed from the dialogue instance.So far, we've collected the dialogue context C and the subjective knowledge source B. The last step is to ground the dialogue in knowledge source.We first ask workers to identify entities that are relevant to the subjective user request as gold entities.We then align the user request and reviews of the gold entities by matching the aspect of user request and that of the review sentences.For example, if the aspect of a user request is about the "WIFI quality" of a hotel, all review sentences of this hotel with "WIFI quality" as the aspect will be relevant knowledge snippets.1Finally, we provide the dialogue context C and all related knowledge snippets K + and ask workers to create a natural and faithful response.We explicitly ask workers to consider the diversity and proportion of opinions in all relevant knowledge snippets during response creation.Instructions can be found in Appendix A.

Data Statistics
We collected 19,696 instances with subjective user requests and subjective-knowledge-grounded responses in total.The average length of the subjective user request and the agent response is 8.75 and 24.07 tokens, respectively.While most of the instances contain a single entity, there are 1,047 in- stances where multiple entities are compared (like Dialogue 2 in Figure 1).Each instance requires on average 3.88 subjective knowledge snippets.To help identify the subjective knowledge-seeking user request, we randomly sample another 18,383 dialogues with non-subjective user requests from the original MultiWOZ dataset.
We split the dataset into training (75%), validation (10.8%), and test (14.2%)sets.Table 2 shows the detailed statistics of each subset.Our validation and test sets contain two subsets: the seen subset where the aspects of these instances are included in the training set, and the unseen subset where the aspects are not included in the training set.The unseen subset is designed to evaluate models' generalizability to arbitrary aspects.

Subjective-Knowledge-Grounded TOD
In this section, we describe the method for SK-TOD.As shown in Figure 2, we follow the pipeline of (Kim et al., 2020) which consists of four sequential sub-tasks: knowledge-seeking turn detection (KTD), entity tracking (ET), knowledge selection (KS), and response generation (RG).The details of each subtask are described as follows.

Knowledge-Seeking Turn Detection
The goal of KTD is to identify the user request that requires to be addressed with subjective knowledge.We regard it as a binary classification problem, where the input is the dialogue context C and the output is a binary indicator.
We employ a pre-trained language model (e.g., BERT (Devlin et al., 2019)) to encode C and adopt the hidden state of the first token as its representation.Then we apply a classifier to obtain the probability that the current user request is a subjective knowledge-seeking request.That is, (1) The model is finetuned with the binary crossentropy loss.

Entity Tracking
The goal of ET is to identify the entities E = {e 1 , • • • , e m } that are relevant to the user request.It can help to reduce the number of candidates in the step of knowledge selection.
We adopt a word-matching-based method used by Jin et al. (2021) to extract relevant entities.It first normalizes entity names in the knowledge source using a set of heuristic rules.Then a fuzzy n-gram matching is performed between the normalized entity and all dialogue turns.To find the entities that are relevant to the last user request, we choose the last dialogue turn in which the entities are detected and use these entities as the output.We leave the tracking of aspects being questioned over multiple turns as future work.

Knowledge Selection
The goal of KS is to select the knowledge snippets that are relevant to the user's request.The inputs are the dialogue context C and the knowledge snippets candidates K, which is a combination of all the knowledge snippets of the relevant entities in E. The output K + ⊆ K is a subset of relevant knowledge candidates.Note that there might be multiple knowledge snippets in K + .
To select relevant knowledge snippets, we calculate the relevance score between the dialogue context C and a knowledge snippet K ∈ K.We regard it as a pairwise text scoring problem and consider two popular approaches: bi-encoder (Mazaré et al., 2018) and cross-encoder (Wolf et al., 2019).Generally, the bi-encoder approach is more efficient while the cross-encoder approach is more accurate.
For the bi-encoder approach, we encode C and K separately using the same pre-trained encoder and obtain two representations, h C and h K .Following Reimers and Gurevych (2019), we use the concatenation of h C , h K , and |h C − h K | as features and apply a classifier to obtain the probability of relevance.That is, (2) For the cross-encoder approach, we encode the concatenation of C and K instead to obtain the contextualized representation.That is, (3) During training, we use all relevant knowledge snippets to construct positive (C, K) pairs.Due to the large size of irrelevant knowledge snippets, we randomly sample the same number of irrelevant snippets to build negative pairs.We optimize the model using the binary cross-entropy loss.During inference, we predict the relevance probability of all knowledge snippets in the candidates.Since both precision and recall matter during KS, instead of selecting the top few results, we use a threshold to determine the relevance, which is estimated from the validation set.

Response Generation
The goal of RG is to create an utterance S t that responds to the user's request based on the dialogue context C and the relevant knowledge snippets K + .We concatenate K + and C as the input and use a pre-trained generation model to create the response.We consider both the decoder-only model (such as GPT-2 (Radford et al.)) and the encoder-decoder model (such as BART (Lewis et al., 2020)).The model is trained to maximize the generation probability p(S T | C, K + ).
To faithfully reflect the diversity and proportion of opinions, the model needs to understand the sentiment polarity of each knowledge snippet, which is challenging due to the lack of direct supervision.To address this issue, we apply a state-ofthe-art aspect-based sentiment analysis (ABSA) model (Zhang et al., 2021) to predict the sentiment Then we incorporate the sentiment information into RG by maximizing p(S T | C, K + , Z).More specifically, we first convert the predicted z i into a natural language description using templates, and then append it to the end of the corresponding K i as the enhanced input of RG.For example, given the knowledge snippet as "The ambience was so fun.", the ABSA model detects the aspect-based sentiment as ("ambience", "positive").We first convert the sentiment into a natural language "ambience is great."and then enhance the knowledge snippet as "The ambience was so fun.ambience is great.".We refer to Appendix B for more details.

Experiments on Sub-Tasks
We first conduct experiments on each individual subtask.To avoid any error accumulation from upstream tasks, we use the gold output of the previous task as the input to each target task.The detailed experimental setup can be found in Appendix C.
Evaluation We report the precision, recall, F 1 score, and accuracy score.
Results Table 3 shows the results of the KTD task.All models achieve similar and near-perfect performance, which is in line with the findings of Kim et al. (2020).It indicates that it is feasible to identify the user request that requires subjective knowledge, which can be then explicitly handled by an SK-TOD component.However, this KTD classifier may work well when restricted only to this dataset or similar, and its generalizability to unseen domains or knowledge types needs to be further explored in future works.

Entity Tracking
Setting We follow the setting of Jin et al. (2021) to run the ET method.
Evaluation We report the instance-level accuracy score.An instance is regarded as accurate only if the predicted entities are the same as the gold entities.
Results The fuzzy n-gram matching method achieves an instance-level accuracy of 92.18%.We further analyzed the type of errors.For 1.8% of the instances, there is at least one gold entity that is missing from the predicted entities.For 7.6% of the instances, the predicted entities contain at least one spurious entity.The latter error case can be further reduced by using model-based matching approaches, which we leave as future work.

Knowledge Selection
Setting We follow the setting of KTD to finetune the KS models.We compare them also with tradi- Evaluation Knowledge selection can be regarded as either a classification task or a retrieval task.For classification, we use precision, recall, and F 1 measures.We calculate these measures at both the instance level and the snippet level.For the instance level, we first calculate P /R/F 1 for each instance, and then use the average over all instances as the final P /R/F 1 .For the snippet level, instead of calculating P /R/F 1 for each instance, we calculate P /R/F 1 for all <C, K> pairs of the entire dataset.For retrieval, we use mean-averageprecision (mAP) as the metric, which is insensitive to the threshold value and can reflect the overall ranking positions of all relevant knowledge snippets.Since the total number of the relevant knowledge snippets varies for each instance, we do not include top-K-based measures such as Precision@K or Recall@K which are commonly used in other Fact-TOD and knowledge-grounded open-domain dialogue problems.
Results Table 4 shows the results of KS.First, when comparing our model with IR baselines, all of the trained models outperform baselines, indicating that the KS model can benefit from the annotated training data.We then compare bi-encoder models and cross-encoder models.As expected, crossencoder models outperform bi-encoder models by a large margin.When comparing the performance of different pre-trained models, there is a large difference among the models under the bi-encoder setting.The variance becomes smaller when applying the cross-encoder architecture.DeBERTa achieves the best performance on all measures in both the bi-encoder and cross-encoder settings.Finally, we compare the performance between the seen subset and the unseen subset.At the bottom of Table 4, we list the performance of De-BERTa on either seen or unseen test subsets.It shows that there is a large gap between the performance of the two subsets, indicating that one of the challenges for the KS model is to generalize from seen aspects to unseen aspects.

Response Generation
Setting We tried GPT-2 (Radford et al.) 3 and DialoGPT (Zhang et al., 2020c), two decoder-only generation models, as well as BART (Lewis et al., 2020) and T5 (Raffel et al., 2020), two encoderdecoder models.We also include BART ABSA and T5 ABSA , two ABSA-enhanced models.During decoding, we use beam-search with top-K sampling (Fan et al., 2018).We set the beam size as 5 and sampled from the top 50 tokens.We compare with a random extractive baseline (EXT) where the response is created by randomly selecting a relevant knowledge snippet.Evaluation Following the evaluation of other generation tasks, We employ BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), METEOR (Banerjee and Lavie, 2005), as well as BERTScore (Zhang et al., 2020b) to evaluate results compared to the reference responses.We also conduct a human evaluation, where we ask crowd workers to assess the appropriateness, aspect accuracy, and sentiment accuracy of generated responses.Results As shown in Table 5, machine-generated response is significantly better than the extractive response.Encoder-decoder models achieve better performance on all automatic measures compared with GPT-based models, indicating that they are more beneficial for this task.They also tend to generate longer responses.There is no apparent difference in automatic measures when comparing BART and T5.BART ABSA achieves the best performance on BertScore while T5 ABSA achieves the best score on BLEU and ROUGE.Human Evaluation While automatic measures 3 We use the base-version of all pre-trained models.
BLEU R-1 R-2 R-L MT BS Len EXT 2.89 23.17 6.53 18.33 9.62 30.83 14.93 GPT2 9.04 33.9 13.52 26.73 16.27 39.73 22.66 DialoGPT 9.19 33.6 13.62 26.81 16.15 39.72 22.05 BART 10.8 36.35 15.04 28.57 17.96 41.12 24.02 BARTABSA 10.78 36.30 15.36 28.47 18.06 41.75 23.66 T5 10.72 36.50 15.57 28.81 18.33 40.84 25.36 T5ABSA 10.97 36.66 15.51 28.88 18.15 40.94 24.75 Table 5: Results of RG task.Models are evaluated using BLEU, ROUGE (R-1, R-2, R-L), METEOR (MT), and BertScore (BS).We also listed the average length (Len) of the generated response.Encoder-decoder models such as BART and T5 achieve better performance compared with GPT2-based models.are not always reliable, we also conduct a human evaluation on AMT to better access the quality of the response.We use the same group of workers as in the data collection.During the evaluation, we show the dialogue context, the oracle knowledge snippets, and all responses (both the reference and the generated responses) to the workers.We sample 240 instances from the test set for evaluation.For each instance, we ask three independent workers to compare the responses according to Appropriateness, Aspect Accuracy, and Sentiment Accuracy.We use the average score over all instances and workers of a system as the system-level score.More details can be found in Appendix D.
As shown in Table 6, machine-generated responses are similar to the references on appropriateness and aspect accuracy.It also shows that incorporating ABSA can improve the model performance on sentiment accuracy.However, there is still a large gap in sentiment accuracy between the best model-generated responses and the references, indicating that it is challenging for current models to faithfully aggregate sentiment information from multiple knowledge snippets.Qualitative Analysis Table 7 shows an example of responses generated by various systems.While all reviews express negative opinions about the water pressure, the responses generated by GPT-2 U: Are there any hotels (not guesthouses) that have 0 stars?S: Yes, actually.Cityroomz is a moderately priced 0 star hotel.U: Do they have free wifi?S: Yes, they do have free wifi!Would you like me to book a stay for you?U: Does the Cityroomz have strong water pressure in the shower?Review 1: However, I noticed that when taking a shower there was very little water pressure which made the shower disappointing.Review 2: The water pressure is not good and it comes out really slow.Review 3: Everything else was a disaster, especially the lack of cleanliness in the room when we checked in, and the incredibly low water pressure in the bathroom when we took our morning showers.Review 4: I encounter a problem when trying to shower and washing hands.Review 5: There is no luggage storage service or anything like that, and the water pressure is much to be desired.GPT-2: The Cityroomz gets some mixed reviews on their water pressure.Some guests found the water pressure to be very good while others found it to be disappointing.Would you like to know more about them?BART: The Cityroomz hotel has very low water pressure according to most guests.Only one person had a good experience with it.BART ABSA : The Cityroomz has a lot of complaints about their water pressure being a bit low.Would you like me to find another place with better ratings?T5: The Cityroomz has mixed reviews about their water pressure.Some guests found the water pressure to be very low while others found it to be disappointing.Would you like to know more about them?T5 ABSA : The Cityroomz has a lot of reviews that say the water pressure is very low and disappointing.Do you want to look at some other places?Reference: No, guests consistently complain about the water pressure, unfortunately.Will that be okay or should I do another search?Table 8: Results of the end-to-end evaluation.We start from RG with gold knowledge as input.We then gradually add components (KS, ET, and KTD) to the pipeline to replace the gold input with the predicted one.
and BART include positive opinions.T5 correctly mentions the negative opinions but the generated response is not natural and coherent.With the help of the ABSA model, both BART and T5 correctly generate the responses with all negative opinions.

Experiments on End-2-End Evaluation
In Section 6, we use the gold information as input for each module to avoid error accumulation.In this section, we evaluate the entire pipeline in an end-to-end manner, where the input of each subtask is predicted by the previous component.We gradually add KS, ET, and KTD to the pipeline, and list the performance of KS and RG in Table 8.The results show that the errors introduced during KS can decrease the quality of response generation.However, ET and KTD do not have much impact on the performance of downstream tasks.It is because the ET and KTD results include less noisy predictions compared with the KS results.8 Comparison with Fact-TOD One difference between SK-TOD and Fact-TOD is that responses in SK-TOD are grounded on subjective knowledge instead of factual knowledge.In this section, we investigate if a Fact-TOD model can ground on subjective knowledge to address subjective requests.To this end, we re-train our KTD (DeBERTa), KS (DeBERTa cross-encoder), and RG (BART) models using the FAQ-grounded TOD data provided by Kim et al. (2020) and then apply it to the test set of SK-TOD without further training.We compare the results of each sub-task with the results of models trained on SK-TOD.
As shown in Table 9, for all tasks, there is a large performance gap between models trained on Fact-TOD and on SK-TOD training data.By checking the model output, we further observe that the Fact-TOD model tends to only ground on and copy from a single knowledge snippet.It indicates that it is difficult to apply the Fact-TOD model to the SK-TOD task directly.It also demonstrates that compared with Fact-TOD, SK-TOD faces new challenges of subjective content understanding and dialogue modeling when incorporating subjective knowledge.We provide sampled outputs and more discussions in Appendix E.

Conclusion
In this paper, we have introduced SK-TOD: a novel task focused on subjective-knowledge-based taskoriented dialogue response generation.We create and release a large-scale, manually-annotated dataset for this task.Incorporating subjective knowledge requires models to accurately identify all relevant knowledge snippets and faithfully aggregate the information into concise responses, which brings unique challenges to this task.Experiments on strong baselines show that there is a large performance gap between human and machinegenerated responses, particularly in faithfully capturing the diversity and proportion of opinions.We hope this task together with the dataset can promote future research on knowledge-grounded TOD and subjective content understanding.

Limitations
The dataset we collected contains two domains, restaurants and hotels.However, to evaluate the model's ability to generalize across different domains, it would be beneficial to include more domains in the dataset.Additionally, to address privacy and copyright concerns, we used crowdsourcing to collect review data, resulting in fewer and shorter reviews than those found in real-world scenarios.This limitation can be mitigated by sampling informative and reliable reviews from realworld data.Regarding the model, we did not investigate more complex models, such as large language models and novel architectures.However, we provide a strong baseline method that will serve as a benchmark for more advanced methods by the research community.

Ethical Considerations
To build our dataset, we collected the dialogue data by augmenting MultiWOZ 2.1, which is a publicly available English dialogue dataset under MIT license.Additionally, we collected the review data using crowd-sourcing, where we provided crowd workers with the reviewer's persona, as well as the aspects and sentiments of reviews.This controlled review collection process helps to exclude offensive or harmful content from the reviews.It also helps to avoid privacy or copyright issues when making the dataset publicly available.Our dataset is available under the CDLA-Sharing 1.0 license.We do not condone the unauthorized use of copyrighted content or web crawling for review collection.We expect anyone using our system to comply with local copyright laws and respect the rights of copyright holders when applying our approach to real-world text.
When annotating responses, we provided multiple subjective knowledge snippets and explicitly asked crowd workers to incorporate the diversity and proportion of all snippets into the response.This approach helps to prevent bias toward majority opinions or randomly selected opinions, which has been overlooked by many other subjective-contentbased benchmarks.Please refer to Table 1 for further details.
To ensure the quality of our dataset, we took great care in selecting pre-qualified workers and designing annotation interfaces.We further conducted a human verification task on the entire dataset to identify invalid instances.The annota-tion showed that 81.89% of subjective-knowledgeseeking user turns are valid (IAA: 0.9369 in Gwet's gamma), and 96.78% of agent response turns are valid (IAA: 0.9497 in Gwet's gamma).Any invalid instances were filtered out or manually corrected before finalizing the dataset.Our data annotation and verification processes adhere to ethical guidelines set forth by ARR, ACL, and ACM.We paid workers an average of $13.82/hr for data annotation and $14.77/hr for data verification.Both exceed the local living minimum wage.The details of our payment settings are elaborated on in the appendix.
For human evaluation, we asked pre-qualified crowd workers to assess the appropriateness, aspect accuracy, and sentiment accuracy of generated responses.Detailed instructions and interfaces can be found in the appendix.The IAA scores for each task were 0.7270, 0.7535, and 0.6239 in Gwet's gamma, respectively.We paid our crowd workers an average of $15.25/hr, $14.40/hr, and $14.85/hr for each evaluation task, which exceeds the local living minimum wage.More details about our payment settings are provided in the appendix.

B Aspect Based Sentiment Analysis
To enhance the model's ability to understand the sentiment polarity of each individual knowledge snippet, we apply PGEN (Zhang et al., 2021), a state-of-the-art aspect-based sentiment analysis model, to predict the sentiment Z PGEN converts the problem of aspect-based sentiment analysis into a sequence generation problem, where the input is the review sentence, and the output is a natural language description of the aspect and the sentiment.For example, given the review sentence as "The ambience was so fun.",where the aspect term is "ambience" and the corresponding sentiment polarity is "positive", PGEN transform the aspect term and the sentiment polarity into a natural language description "ambience is great."using templates.It is transformed by keeping the aspect term unchanged and mapping the positive/neutral/negative sentiment polarities into one of the three tokens: "great", "ok", and "bad".The model is trained using a BART-base model on semeval aspect-based sentiment analysis datasets (Pontiki et al., 2015(Pontiki et al., , 2016)).

C Training Details
For KTD and KS, the implementation is based on Transformers (Wolf et al., 2020).During training, we use AdamW (Loshchilov and Hutter, 2018) with a learning rate of 3 × 10 −5 and a batch size of 16.We apply warmup (Goyal et al., 2017) on the first 500 steps and early stopping based on the model performance on the validation set.We use a Tesla V100 GPU with 16 GB memory for training models.It takes 1 hour to train a KTD model and 5 hours to train a KS model.
During inference, we set the classification threshold as 0 for KTD, as we observe that KTD results are insensitive to the threshold.However, for the KS model, the setting of the threshold can greatly impact the precision and recall scores.We therefore choose the best threshold based on the F 1 scores on the validation set.We use a grid search between -5 to 5. The optimal thresholds for BERT, RoBERTa, ALBERT, and DeBERTa are 2.25, 1, 1.75, and 2 in the bi-encoder setting.They are 3.1, 4.6, 3.25, and 3.4 in the cross-encoder setting.
For ET model, we follow the setting of (Jin et al., 2021) to identify entities.More specifically, we perform the fuzzy n-gram matching between an entity and the utterance, where n is the same as the length of the entity mention.The n-gram matching score is calculated based on the ratio of the longest common sequence between two n-grams.We set the matching threshold as 0.95.
For RG model, during training, we use AdamW with a learning rate of 3 × 10 −5 and a batch size of 16.We apply the warmup on the first 500 steps and the early stopping based on the model performance (perplexity) on the development set.The model is trained on a Tesla V100 GPU with 16 GB memory for 2 hours.

D Human Evaluation
We ask workers to compare these responses according to the following three measures: • Appropriateness: whether the response is fluent and naturally connected to the dialogue context.
• Aspect Accuracy: whether the response provides relevant and useful information to the aspect that the user queried.
• Sentiment Accuracy: whether the sentiment proportion provided by the response is accordant with that of the subjective knowledge.
For Appropriateness and Aspect Accuracy, we only show the dialogue context and the responses.For Sentiment Accuracy, further show the oracle knowledge snippets.To increase the annotation quality, we first ask workers to annotate the sentiment label of each knowledge snippet, and then evaluate the sentiment accuracy of each response.All three measures are evaluated using the 5-Point Likert scale.We pay $0.25 for the task of Appropriateness and Aspect Accuracy and $0.4 for the task of Sentiment Accuracy.Figure 6 shows the interface of evaluating Sentiment Accuracy.The interfaces of Appropriateness and Aspect Accuracy are similar to that of Sentiment Accuracy.

E Comparison with Fact-TOD
In Table 10 and 11, we provide two responses generated by the Fact-TOD model and the SK-TOD model, respectively.It can be found that responses generated by the Fact-TOD only ground on a single knowledge snippet, indicating that the Fact-TOD model lacks the ability to aggregate information from multiple knowledge snippets, especially when there are diverse and contradictory opinions.We can also find that Fact-TOD tends to copy from knowledge snippets but fail to distinguish between facts and opinions.It is because when grounding on factual knowledge, it is sufficient to ground on a few knowledge snippets and copy from the factual knowledge text.However, when grounding U: Hi! I'm looking for a place to stay with 3 stars that includes wifi.S: I can help you with that.I have several options that fit this criteria.Do you prefer a particular area of town?U: Just near Cambridge , and should be a guesthouse.S: Bridge guest house is a guesthouse in the south in the moderate price range with free wifi and parking.Would you like to reserve a room?U: Certainly.There will be 5 of us for 4 nights starting on Tuesday.S: I was able to book for 5 people for 4 nights starting on Tuesday.The reference number is 97VS1TDE.Is there anything else I can help you with?U: I prefer a big room so do you know how big the room is?Review 1: The room was smaller than expected and I was surprised that they didn't have any luggage storage service for the amount we paid.Review 2: Our room was bright and cheerful with lots of space and a newly refurbished bathroom.Fact-TOD: The Bridge Guest House's room is smaller than expected and I was surprised that they don't have any luggage storage services.Do you need anything else?SK-TOD: According to the reviews I have on hand for that location, the opinions are mixed.Half found the rooms smaller than expected and half found them bright and cheerful with lots of space and a newly refurbished bathroom.
Table 10: The first sampled output from the Fact-TOD model and the SK-TOD model, respectively.U: I am looking for a guesthouse to stay at that has free wifi.S: Ok.Were you hoping for a particular price range or part of town?U: I don't care about the part of town or the price range, but it needs to be a three star hotel.S: The Hamilton Lodge would be perfect!It is a moderately priced 3 star guesthouse on the north side of town.Does that sound okay to you? U: Does it have comfortable beds?Review 1: The bed really needed a new mattress, too.Review 2: That being said, the room was spacious and the bed was very comfortable.Review 3: We both slept extremely well every night we were there.Fact-TOD: Yes, the Hamilton Lodge has a new mattress.Would you like to make a reservation?SK-TOD: The Hamilton Lodge has really comfortable beds according to most guests, but one guest did say that the bed needed a new mattress.on subjective knowledge, it is essential to aggregate information from multiple snippets and handle diverse and contradictory opinions.

Figure 2 :
Figure 2: The pipeline architecture of SK-TOD.

Figure 3 :
Figure 3: The interface of review collection.

Figure 4 :
Figure 4: The interface of user request collection.

Figure 5 :
Figure 5: The interface of response generation.

Figure 6 :
Figure 6: The interface of evaluating the sentiment accuracy of responses.
• We create and release a large-scale, humanannotated dataset designed for this task; • We propose a new model and conduct extensive experiments on the proposed task.
2.1 Knowledge-Grounded DialogueKnowledge-grounded response generation is popular in the open-domain dialogue.Many external knowledge sources have been explored, from structured knowledge such as fact tables(Moghe

Table 2 :
Basic statistics of our dataset.

Table 3 :
Results of KTD task.Models are evaluated using Accuracy, Precision, Recall, and F 1 .All models achieve similar and near-perfect performance.

Table 4 :
Results of KS task.Models are evaluated using instance-level and snippet-level classification measures, as well as mAP, a retrieval-based measure.De-BERTa achieves the best performance among all evaluation measures.

Table 6 :
Results of human evaluation for RG.

Table 7 :
Sampled output of different RG models.

Table 9 :
Comparison between models trained on Fact-TOD and SK-TOD training data.

Table 11 :
The second sampled output from the Fact-TOD model and the SK-TOD model, respectively.