Building and Evaluating Open-Domain Dialogue Corpora with Clarifying Questions

Enabling open-domain dialogue systems to ask clarifying questions when appropriate is an important direction for improving the quality of the system response. Namely, for cases when a user request is not specific enough for a conversation system to provide an answer right away, it is desirable to ask a clarifying question to increase the chances of retrieving a satisfying answer. To address the problem of ‘asking clarifying questions in open-domain dialogues’: (1) we collect and release a new dataset focused on open-domain single- and multi-turn conversations, (2) we benchmark several state-of-the-art neural baselines, and (3) we propose a pipeline consisting of offline and online steps for evaluating the quality of clarifying questions in various dialogues. These contributions are suitable as a foundation for further research.


Introduction
The ultimate goal of a conversational system is to assist users by returning an appropriate answer in response to their requests (Kiseleva et al., 2016a;Li et al., 2021). Recent progress on neural approaches to natural language processing Liu et al., 2019;Clark et al., 2020), and the availability of large amounts of conversational data have triggered a renaissance in end-to-end neural open-domain chatbots (Adiwardana et al., 2020;Roller et al., 2021;Zhang et al., 2020;Burtsev et al., 2017;. There has been great progress on suggesting measures to evaluate what makes a conversation satisfying for users using various human evaluation techniques (Li et al., 2019a;See et al., 2019b). Those efforts showed that suggested large pre-trained models do not always perform seamlessly (See et al., 2019a), and there are still several challenges needed to be solved for open-domain conversational systems (Huang et al., 2020).
How to write a thank you letter after an interview? Are you interested in example of thank you letters for interview?

Yes
Here are a few examples: Tell me about source of the Nile Are you referring to the Nile river?
No, the board game called the source of the Nile I'm looking for information on JAC chemical company [conversation continued] Are you interested in a specific type of product? Yes, surface treatment products that JAC chemical company manufactures Here is information about their products: (b) and (c) demonstrate a situation when the request is ambiguous and a system needs to act. Nass and Moon (2000) conclude that people have similar expectations from talking to bots and humans. This similarity is a possible explanation for why sometimes user requests might be ambiguous and incomplete, as shown in Fig. 1 (b) and (c). This ambiguity is especially challenging to handle in a dialogue setting, where a system is limited by returning only one answer in response to each request, unlike in web search setup where diversification of results is possible and acceptable (Vallet and Castells, 2012). Previous research has shown that users are much more forgiving about system mistakes if they can act on them with minimal efforts spent (Kocielnik et al., 2019;Kiseleva et al., 2016b). Therefore it is more appropriate to ask a clarifying question in user request ambiguity rather than generating incorrect answers. There are separate attempts to explore the following related tasks: (1) identifying a moment when the question should be asked in the course of conversation (Hancock et al., 2019); and (2) retrieving a clarification question (Rao and III, 2018;Wang et al., 2018). In this paper, we aim to combine these related aspects and study the following problem of generating clarifying questions for open-domain conversations: the system must identify wherever the question is ambiguous, and, if so then instead of trying to answer it directly, it should ask a good clarifying question (Fig. 1). One possible stumbling block preventing the community from studying the problem of open-domain clarifying question generation to enhance user experience while interacting with a conversational bot (Huang et al., 2020) is the lack of suitable datasets, which we address in this work. To summarise, the main contributions of this work are: C1 releasing a dataset dedicated to the problem of asking a clarifying question in open-domain dialogue systems. The dataset includes single-(∼15K) and multi-turn (∼1.5M) conversations and covers ∼300 various topics and it is suited to study: (1) when a clarifying question should be asked given the current context of the conversation; and (2) which question should be asked; C2 benchmarking several state-of-the-art (SoTA) neural models; and C3 building an evaluation pipeline that provides fast iteration and involves two stages: (1) offline (automatic evaluation); and (2) online (human-in-a-loop to converse with the system). 1 We release the collected dataset, offline evaluation pipeline, and the code for running explored neural SoTA models. These models can be employed as baselines for the task. 2

Related work
Our work is broadly relevant to two strands of research: learning to ask clarifying questions in opendomain conversational settings (Section 2.1) and evaluating dialogue systems (Section 2.2).

Learning to ask clarifying questions
Information retrieval community has paid close attention to the problem of ambiguity in user search queries. Previously this problem was addressed through the diversification of search result pages (Radlinski and Dumais, 2006;Allan, 2016, 2014), including via usage of personal and contextual data (Jiang et al., 2015;Kato 1 The pipeline was designed as part of the ConvAI3 (Aliannejadi et al., 2020) data challenge (https://convai.io) 2 Available at https://github.com/ aliannejadi/ClariQ and Tanaka, 2016). Recently, (Rosset et al., 2020;Aliannejadi et al., 2019;Zamani et al., 2020a) suggest techniques to address ambiguity by generating clarifying questions.
Where the general settings are: (1) a user is issuing an ambiguous keyword query; (2) a search engine's goal is to suggest conversational clarifying questions to help to find the required information (Krasakis et al., 2020;Lotze et al., 2021;Sekulic et al., 2021;Aliannejadi et al., 2021). These works also resulted in a number of datasets, e.g. Qulac (Aliannejadi et al., 2019) and MIM-ICS (Zamani et al., 2020b), which consists of queries, issued by real users, and behavioral signals such as clicks. Braslavski et al. (2017) focus on characteristics, forms, and general patterns of clarifying questions.
Suggesting a clarifying questions is closely related to question answering (Q&A) (Kwiatkowski et al., 2019;Soleimani et al., 2021) and question generation (QG) domains (Gao et al., 2019;Chai and Wan, 2020). Trienes and Balog (2019) made an attempt to understand unclear questions, Li et al. (2017) suggesting an RL-based method for deciding when to ask for user feedback in Q&A setup.
Recently, proactive bot behavior has started to attract researchers' attention in dialogue settings yet remains rather untouched (Huang et al., 2020). Rao and III (2018) designed a model to rank a candidate set of clarification questions by their usefulness to the given post at Stack Exchange, which targeted the problem which question to ask. The resulted dataset was released, but it covers specific narrow topics. In contrast, Hancock et al. (2019) focused on when to ask a question in order to selfretrain a bot, which has been resulted in releasing a dataset. Wang et al. (2018) studied QG techniques in application to open-domain conversations.

Evaluating Dialogue Systems
Dialogue systems are generally separated into two types: task-oriented and open-domain. The taskoriented ones usually have clear criteria for evaluation, e.g. turn correction ratio, inappropriate utterance ratio, proxies for accuracy, and success rate (Takanobu et al., 2019;Li et al., 2016;Su et al., 2018;Li et al., 2020). Despite significant efforts to introduce automatic metrics to evaluate opendomain conversations (Reiter, 2018;Novikova et al., 2017;Lowe et al., 2017), it remains area for exploration (Li et al., 2019a(Li et al., ,b, 2021. To the best of our knowledge the current standard approach for evaluating open-domain dialogues requires employing human assessments via crowdsourcing platforms (Zhang et al., 2018;Li et al., 2019a) or engaging volunteers to participate in research competitions (Burtsev et al., 2018;Dinan et al., 2020;Burtsev and Logacheva, 2020;.
Therefore, we can conclude that understanding and generating open-domain clarification questions is a major component in conversational information-seeking systems, which is still under exploration. Hence, our efforts on collecting datasets and investigating the performance of the neural SoTAs are timely and useful for future research in this area.

Problem Setting
Our main goal is to collect a dataset to enable studying clarifying questions generation whenever appropriate, as depicted in examples in Fig. 1. Fig. 2 demonstrates a pipeline that makes it possible to process user requests in the open domain as follows: 'User Request Understanding' (URU) decides which module to call either 'Clarifying Question Generation' (CQG) or 'Answer Generation' (AG). In this work, we focus on the first two. We aim to collect the following data: • User Request (U ): an initial user request in the conversational form, e.g., 'What is Fickle Creek Farm?' with a label reflecting whether clarification is needed; • Set of clarification questions ({Q}): a set of possible reasonable clarifying questions that address multiple aspects/facets of U , e.g., Q 1 : 'Do you want to know the location of fickle creek farm?', Q 2 : 'Would you like to know the history with a user answer, e.g., the answer to Q 1 is A 1 : 'No, I want to find out where can I purchase fickle creek farm products', the answer to Q 2 is A 2 : 'I just need general information about fickle creek' The collected dataset of pairs (U, {Q, A}) can be easily transformed to a set of single-turn conversations consisting of the coherent and consistent triples (U, Q, A) as shown in the example in Fig. 2. We ask items in a triple, U , Q and A, satisfy the following requirements: R1 user requests (U ) must cover various conversational topics to represent open-domain dialogues; R2 the final collection of U should contain both types: ambiguous and unambiguous; R3 each inquiry U to the system should be in the conversational form; R4 the need for clarification should be predetermined as a label for each U in the collection; R5 each clarifying question (Q) should be reasonable, coherent with U and address multiple facets of every ambiguous request U ; and R6 each user answer A should be consistent with the clarifying question from the system.
After collecting of single-turn conversations, they are used to train various conversational agents. To collect multi-turn conversations, the two bestperforming agents are utilized to converse with crowdsourced workers, who evaluate a system quality and reply to suggested clarifying questions. Finally, the two agents are evaluated using Acute-eval framework (Li et al., 2019a), which is best available practice for online evaluation of open-domain dialogue systems. Overall, our pipeline for data collection and evaluation is summarized in Fig. 3.

Data Collection
Following the suggested pipeline in Fig. 3  (2) we significantly extended it by crowdsourcing more data through Human Intelligence Task (HIT) on Amazon Mechanical Turk 7 , which design follows a general strategy proposed in (Aliannejadi et al., 2019). Namely, we asked the workers to imagine themselves acting as a conversational agent 8 where an imaginary user had asked them about a topic. Then, we described the concept of facet to them, supporting it with multiple examples. Finally, we ask Turkers to do the following: • discover the facets of each U using a preferred search engine and scan the results in the first three pages; and • generate six questions related to U , aiming to address the facets they had figured out. We assigned two workers per HIT, resulting in 12 questions per U in the first round. To preserve the questions' language diversity, we limited each worker to a maximum of two HITs. HITs were available to workers residing in the U.S. who had an approval rate of over 97%.

Controlling Quality of Clarifying Questions
To estimate the quality of the collected questions, we aim to address two main concerns: (1) how good are the collected clarifying questions?; and (2) Is the set of clarifying questions diverse (in other words, addressing different facets associated with the topic)? Given the high complexity of this task, we appointed two expert annotators. They were instructed to read all the collected questions on each topic, marking invalid and duplicate questions. Annotators were asked to match a question to a facet if its answer would address the facet. Finally, to ensure that all facets were covered by at least one question, we asked the annotators to generate an additional question for each facet that needed more specific questions.
Collecting Answers To satisfy R6, we designed another HIT to collect coherent and consistent answers to the clarifying questions. The task started with detailed instructions followed by several examples. The workers were given U and a facet description. Then we instruct them to assume that they had submitted the initial user request U with their actual information need being the given facet. Then workers were required to write the answer to the one clarifying question that was presented to them. If a question required information other than what workers were provided with, they were instructed to use a 'No answer' tag. Each worker was allowed to complete a maximum of 100 HITs to ensure language diversity. Workers were based in the U.S. with an approval rate of 95% or greater. Controlling quality of Collected Answers During the course of data collection, we performed regular quality checks on the collected answers. The checks were done manually on 10% of submissions per worker. In case we observed any invalid submissions among the sampled answers of one worker, we then studied all the submissions from that workers. Invalid submissions were then removed from the collection, and the worker was banned. Finally, we assigned all invalid answers to other workers to complete. Moreover, we employed basic behavioral check techniques in the design of the HIT. For example, we disabled copy/paste features of text inputs and tracked workers' keystrokes. This enabled us to detect and reject low-quality submissions.
As an outcome, we have a high-quality collec-tion of single-turn conversations in the form of required triples: (U, Q, A), which is marked as P 1 in our pipeline in Fig. 3. Tab. 3 provides a statistics on collected dataset of single-turn conversations.

P 3 : Crowdsourcing Multi-Turn Dialogues
The collected dataset of single-turn is sufficient to train and evaluate several conversational agents. More technical details on training and evaluation are provided in Sec. 5. For now, we assume that as a result of P 2 : DA is one of the best-performing trained dialogue agents. We assume that the trained DA can have a conversation with users. Namely, it should either ask a clarification question or give a factual answer to the user's request at each dialog step. Therefore, the trained DA is capable of: • providing clarification question whenever appropriate in the course of the conversation; • interpreting user's answer to the clarifying question.
To collect multi-turn conversations, we utilize best-performing dialogue agents that can accommodate an arbitrary number of turns, having two goals in mind: G1 evaluating the quality of the agents in multiturn settings where they converse with real humans; and G2 collecting a new dataset of multi-turn conversations with respect to clarification questions. To reach G1, the idea is to run the agents multiple times with different turns and evaluate them accordingly. For that purpose, we design a HIT similar to the ones described in Sec. 4.1 with the difference in the context of a conversation. We instructed crowd workers to understand the user's actual information need and imagine they are looking for the same information. Then, follow a conversation on that as presented in Fig. 4. The workers were instructed to answer the last question in the conversation while considering the conversation's Figure 4: An example of the task provided at the HIT for multi-turn conversations, which asks to submit the answer given dialogue history/context. context, which consists of previous questions and answers. The context of conversation could include 1-2 rounds of question-answer interactions. To check the quality of clarifying questions returned by the trained dialogues, we instructed workers to indicate if the question was not understandable or was not in a proper language. As a result, such questions were removed from the collection. We use the same quality check procedure for the collected answers as described in the previous section. Tab. 3 provides a statistics on collected multi-turn conversations, to achieve G2. Synthetic Multi-Turn Conversations We also generate synthetic multi-turn conversations for training purposes. To do so, for each topic, we create a set of all possible combinations of questions (2 or 3 questions) together with their corresponding answers.

Models and Evaluation
Following the suggested pipeline in Fig. 3, we explain our contributions regarding the evaluation of 'asking clarifying questions in open-domain dialogues' problem, namely: • P 2 : how the dialogue agents are trained and automatically evaluated based on single-turn conversations (Sec. 5.1); • P 4 : how evaluation of multi-turn conversations is performed from both perspectives: offline automatic manner and having human-in-the-loop using Acute-eval (Li et al., 2019a) (Sec. 5.2). We design our experiments to collect answers to the following research questions: RQ1 When to ask clarifying questions during open-domain dialogues? (Sec. 5.1.1) RQ2 Which clarifying question to ask for a given context of a conversation? (a. the single-turn conversations case is described in Sec. 5.1.2; b. multi-turn one -Sec. 5.2)

P 2 : Evaluating Single-Turn Agents
The collected dataset, described in Sec. 4.1, is split into training (70%), validation (dev) (10%), and test (20%) sets. We split the data based on the search topic and maintained the same split for all single-turn and multi-turn experiments. During the evaluation procedure, the following is used: (1) a set of conversational user requests, and (2) a set of questions (i.e., question bank), which contains all collected questions on all the topics.

Predicting Clarification Need
Task The task is, given a user request, return a score from 1 (no need for clarifying questions) to 4 (cannot provide any answers without user clarification) indicating the necessity of asking clarifying questions (as depicted in module 'Understanding User Request' in Fig. 2). Automatic Evaluation To evaluate the performance of the suggested classifier, we use Precision, Recall, F1-Measure, and Mean Squared Error (MSE). Tab. 4 presents the collected results of various classification methods, which includes Robertabased classifier (Liu et al., 2019), BART (Chipman et al., 2010), and BERT-based classifier . Based on the supplied results, we can answer RQ1: the task is rather difficult and potentially can benefit from more exploration despite the reasonable performance of the proposed baselines.

Returning Clarifying Question
Task The task is, given a user request which needs clarification, return the most suitable clarifying question from the supplied question bank (as shown in module CQG in Fig. 2).
Automatic Evaluation We introduce two main strategies for evaluation: (1) document relevance and (2) question relevance.

Document Relevance
To estimate the relevance of the retrieved documents we use the following standard metrics: Mean Reciprocal Rank (MRR) (Voorhees, 1999;Radev et al., 2002), Precision (P)@[1,3,5,10,20], Normalized Discounted Cumulative Gain (nDCG)@[1,3,5,20] (Wang et al., 2013). These metrics are computed as follows: a selected clarifying question, together with its corresponding answer, is added to the original user request. The updated query is then used to retrieve (or re-rank) documents from the collection. The quality of the question is then evaluated by measuring how much the question and its answer affect document retrieval performance when added to the initial request. We evaluate document relevance based on the relevance assessments provided by the TREC Web Track.
Question Relevance Models are also evaluated in how well they can rank relevant questions higher than other questions in the question bank. For this task, which we call 'question relevance,' the models are evaluated in terms of Recall@[10,20,30].
Since the precision of models is evaluated in the document relevance task, here we focus only on recall.
The suggested evaluation metrics are collected for a number of baselines (B) and fine-tuned stateof-the-art NLP models (M):   methods suggested above are reported in Tab. 5 and Tab 6 for single-turn conversations, to answer RQ2.a, we can conclude the performance of the best-performing fine-tuned neural SoTAs is reasonable and we can use them for multi-turn conversation.

P 4 : Evaluating Multi-Turn Conversations
Task The task is, given an ongoing conversation with multiple turns, select or generate the next question that would clarify the user's intent best. The main goal is to learn from previous user feedback and ask a question that would lead to the highest information gain. Automatic Evaluation Similar to the single-turn task, we evaluate the effectiveness of the baseline models based on document relevance. Therefore, we utilize the whole conversation context, clarifying questions, and human responses to retrieve documents from the collection and assess the quality of a question based on its impact on ranking performance. Note that we do not evaluate multiturn models in terms of question relevance, since the question relevance is intended to evaluate recall of questions related to the search topic. Due to complexity and costs of the evaluation, we pick two best-performing models from Sec. 5.1.2 for this task. To do so, we use the synthetic training data to fine-tune ELECTRA and Roberta similarly to our single-turn setup, but in the multi-turn case the whole history context is considered as a user request. The module for deciding whenever the request needs clarification is preserved. We see in Tab. 7 that ELECTRA outperforms Roberta in  terms of all evaluation metrics by a margin. One promising future research line might be exploring what properties of these two models lead to that difference in their effectiveness for this task.
Human Evaluation To ensure that our automatic evaluation reflects the dialogues' quality, we conduct a pairwise human evaluation on two of the baselines. We use and extend the Acute-eval human annotation framework (Li et al., 2019a) to evaluate 120 randomly sampled dialogue pairs. For consistency, we use the four questions that the authors suggest measuring Humanness (HU), Engangingness (EG), Interestingness (IN), Knowledgeable (KL), and add a fifth one, specific to our task, on Clarification (CL). We modify the crowdsourcing task to inform the annotators about the conversation's main goal (i.e., information seeking). Furthermore, it is crucial to ensure that the annotators consider the model's ability to understand the user's feedback and incorporate the additional knowledge when asking its next question. Therefore, we added another question to examine this aspect of the conversation. As shown in Fig. 5, after showing two full conversations to the annotators, they evaluated the model's ability of clarification by answering the following question: 'Which one asks better (or more reasonable) clarifying questions?'. Tab. 8 reports the results of our human evaluation of 120 dialogue pairs in terms of percentage of cases that ELECTRA beats Roberta based on the human annotation. We see that ELECTRA is judged to be the best model in most cases for all five aspects. It is interesting to see that the human annotation is in line with the proposed automatic annotation, suggesting that our approach approximates the true quality of the models. Moreover, we see that our new evaluation dimension, Clarification, achieves Based on reported results for automatic evaluation (Tab. 7), which is aligned with human-in-aloop one (Tab. 8), we can conclude that suggested methods can be solid baselines for the follow-up research as an answer RQ2.b.

Conclusions
Asking clarifying questions when a user request is ambiguous is essential for developing human-like open-domain dialogue systems. In this work, we introduce a large-scale dataset that covers almost 300 different topics and is suitable as a foundation for further research. The collected dataset includes both single-and multi-turn conversations. We benchmark several state-of-the-art neural models that were fine-tuned for asking clarifying questions. Based on how these models performed, we conclude that they are solid baselines for future research that more fully explores the problem space. In this paper, we also suggest an offline automatic evaluation pipeline, which agrees with human-inloop evaluation.
We publicly release the collected datasets, the code for training baselines, and our evaluation procedure in order to push forward the state-of-the-art.