CDialog: A Multi-turn Covid-19 Conversation Dataset for Entity-Aware Dialog Generation

The development of conversational agents to interact with patients and deliver clinical advice has attracted the interest of many researchers, particularly in light of the COVID-19 pandemic. The training of an end-to-end neural based dialog system, on the other hand, is hampered by a lack of multi-turn medical dialog corpus. We make the very first attempt to release a high-quality multi-turn Medical Dialog dataset relating to Covid-19 disease named CDialog, with over 1K conversations collected from the online medical counselling websites. We annotate each utterance of the conversation with seven different categories of medical entities, including diseases, symptoms, medical tests, medical history, remedies, medications and other aspects as additional labels. Finally, we propose a novel neural medical dialog system based on the CDialog dataset to advance future research on developing automated medical dialog systems. We use pre-trained language models for dialogue generation, incorporating annotated medical entities, to generate a virtual doctor's response that addresses the patient's query. Experimental results show that the proposed dialog models perform comparably better when supplemented with entity information and hence can improve the response quality.


Introduction
Currently, telemedicine is absolutely appropriate in reducing the risk of COVID-19 among healthcare providers and patients due to the diversion of medical resources as millions of people around the world have experienced delays in diagnosis and treatment. Conversational agents (Gopalakrishnan et al., 2019;Zhao et al., 2020;Reddy et al., 2019) have been proved to be effective in carrying on a natural conversation and understanding the meanings of words to respond with a coherent dialog. It has been also effective in * Equal Conrtibution providing support to complete several tasks such as booking a ticket (Liao et al., 2020), getting reservations , etc. In medical domain, Li et al., 2021;Xu et al., 2019) have come up with standard techniques to model medical dialogs which reduces face-to-face consultations, resulting in reduced costs and helps the patient get quicker medical treatment. However, medical dialog systems are more difficult to implement than the standard task-oriented dialog systems (TDSs) as there are several other professional phrases / formal medical expressions that are frequently conveyed while communicating .
A significant effort has recently been undertaken to collect medical dialog data for research on medical dialog systems Liao et al., 2020;. They all, however, have some limitations: (i) A comprehensive diagnosis and treatment procedure is lacking. (ii) Labels are not fine-grained enough. Prior research has typically provided a single poorly graded label for the entire utterance, which may mislead model training and/or lead to erroneous assessment. Furthermore, the scale of the medical entities involved is limited. (iii) Dialog length is limited to an average of 2 turns only. From Figure 1, it can be seen that the original CovidDialog corpus  has a dialog with only one turn and the patient and doctors utterances are also too lengthy having all the information together at one place. We attempt to split this dialog to make it more suitable for dialog settings by separating and pairing the doctors' and patients' utterances at appropriate points. For example, the first sentence of the patient's query (c.f Q) from Figure 1, is chosen as the first utterance (c.f X 1 ) for the multi-turn dialog as shown on the right. To maintain the dialog flow, we include generic utterances by doctors as the second utterance such as "Yes sure, please state your concern." (c.f X 2 ). We Q: I am a 23-year-old man. I have anxiety and depression but no immunodeficiency disorders or chronic diseases. First, I wanna know if my immune system is weakened and how likely am I to die of Coronavirus. Second, I have itchiness in my throat and shortness of breath. I always have it because of anxiety but now it is more persistent than before. I also feel a very subtle feeling of pain, burning sensation and itchiness in my chest. I did not go out during the past ten days and have not been in contact with someone with positive Covid. And I quit smoking past week. Down to one or zero cigarettes from 20 a day. Why am I feeling itchiness and pain and burning sensation? How likely am I to have Covid-19? And how likely am I to die from it?
A: Hello. Anxiety can manifest itself in physical or psychological symptoms or both. The irritation sensation you are experiencing is a part of your anxiety. Also please do not believe the hype about covid. It has a low mortality rate, of 2-3 percent, that too mortality is very high in people above 60-70 years with other co-morbidities. You do not have any such disorders and you have not even been in contact with anyone, so do not worry. Please continue to take your medicines for anxiety if you are taking them, if not, please consider visiting a psychiatrist and get started on some low dose SSRI type of medicines. Also, if needed a low dose Benzodiazepines can be added temporarily. Also, please do some deep breathing exercises or progressive muscle relaxation. You can also take some honey with water to reduce the itching in the throat and also try doing some gargles with lukewarm saltwater.
I am a 23-year-old man and I have some queries regarding coronavirus. Can you help me?
Yes sure, please state your concern.
I have anxiety and depression but no immunodeficiency disorders or chronic diseases. First, I wanna know if my immune system is weakened and how likely am I to die of Coronavirus.
Anxiety can manifest itself in physical or psychological symptoms or both. Also please do not believe the hype about covid. It has a low mortality rate, of 2-3 percent, that too mortality is very high in people above 60-70 years with other comorbidities.
Thank you doctor, my second concern is that I have itchiness in my throat and shortness of breath. I always have it because of anxiety but now it is more persistent than before.
Do you also have any other medical concerns like pain in the body ?
I also feel a very subtle feeling of pain, burning sensation and itchiness in my chest.
Did you have any travel history or have you been in contact with any foreigner?
I did not go out during the past ten days and have not been in contact with someone with positive Covid.
Do you smoke or drink?
I quit smoking past week. Down to one or zero cigarettes from 20 a day.
Is there anything else you wanna tell?
Why am I feeling itchiness and pain and burning sensation? How likely am I to have Covid-19? And how likely am I to die from it?
You do not have any such disorders and you have not even been in contact with anyone, so do not worry. Please continue to take your medicines for anxiety if you are taking them, if not, please consider visiting a psychiatrist and get started on some low dose SSRI type of medicines. Also, if needed a low dose Benzodiazepines can be added temporarily.
can you recommend some exercise and home remedies? please do some deep breathing exercises, or progressive muscle relaxation. You can also take some honey with water to reduce the itching in the throat and also try doing some gargles with lukewarm salt water.  Figure 1: Sample conversation from the CDialog dataset. Sample on left side is from existing CovidDialog dataset . We have extended this to a multi-turn dialog with eight turns along with entity information.
Right side shows such extended samples.
also include appropriate sentences from doctor's response (c.f A), as subsequent utterances (c.f X 4 ) which comprehends to patient's utterance (c.f X 3 ) at that point. Further, we also assign fine-grained medically relevant categories to these utterances. For example, for the third utterance in Figure 1, there are two different kinds of categories: informing symptom status (Symptoms: anxiety, depression) and inquiring diseases (Disease: . To address the issue of lack of medically relevant dialog data, we create CDialog, a multi-turn Medical Dialog dataset pertaining to Covid-19 disease. As indicated in Table 1, our dataset has the following advantages over the existing conversational datasets. First, our dataset is the largest Covid-19 related dialogue dataset with highest average number of dialogue turns, and thus more suitable for training neural conversation models. Second, CDialog is informational and diversified, with 12 types of diseases and 253 types of entities, which is far more representative of an actual medical consultation scenario. Furthermore, to gain a better grasp of the response generation task, we compare a number of cutting-edge models on CDialog by using popular pre-trained language models like BERT (Devlin et al., 2019) and GPT (Radford et al., 2019). Moreover, we create a medical entity-aware dialog system that makes use of entity-level knowledge. According to the experimental results, combining entity information with dialog history in the gener-ation process improves the response quality.
Our current work makes the following contributions: 1. We build and release CDialog, a multi-turn medical dialog dataset related to Covid-19. CDialog has around 1K conversations and with more than 7K utterances annotated with seven types of medical entities, giving it a credible standard for evaluating the medical consultation capabilities of dialog systems. 2. On the CDialog dataset, we present several baselines for response generation and propose techniques for utilizing the relevant medical dialog entities in the medical dialog system. 3. We conduct rigorous experiments, including quantitative and qualitative evaluation, to evaluate a number of cutting-edge pre-trained models for medical dialog generation. Empirical evaluation demonstrates that annotated entities as auxiliary information significantly improves the response quality.

Related Work
For dialog generation, sequence-to-sequence models (Vinyals and Le, 2015;Sutskever et al., 2014)   Recent work by (Zhang et al., 2020a) using pretrained language models has demonstrated captivating performance on generating responses that make sense under the conversation contexts while also carrying out specific content to keep the conversation going by fine-tuning GPT-2 (Radford et al., 2019) in different sizes on social media data. Among all accessible pre-trained language models, BERT is commonly utilised in the medical domain, as several models, such as BioBERT (Lee et al., 2020), Clinical-BERT (Alsentzer et al., 2019), and so on are implemented using the data from a specific domain.
Information extraction (Zhang et al., 2020b), relation prediction (Du et al., 2019;Lin et al., 2019;Xia et al., 2021), and slot filling  are some of the recent tasks performed on medical data. In medical domain, the use of a reinforcement learning framework in dialog systems  has encouraged dialog management strategy learning. Further (Xu et al., 2019) increased the rationality of medical conversation decision-making by including external probabilistic symptoms into a reinforcement learning framework. Liao et al. (2020); Xia et al. (2020) used hierarchical reinforcement learning for automatic disease diagnosis. These RL systems, on the other hand, solely learn from tabular data containing the existence of symptoms, ignoring the importance of other key information such as symptom features, tests, and treatment. Furthermore, (Ferguson et al., 2009;Wong et al., 2011;Gatius and Namsrai, 2012;Liu et al., 2016a) constructed early end-to-end medical dialog systems on large scale Chinese medical dialog corpora.  ) released a high-quality unlabelled medical dialogue dataset named MedDialog in Chinese and English covering more than 50 diseases. Although, MedDialog corpora contains the highest number of dialogs, they do not cover dialogs on Covid-19 and have an average dialogue length of only 2. Furthermore,  released a general-domain medical dialog corpus containing 2K labelled data and 100K unlabeled data, but in the form of individual utterances rather than the entire dialog. MedDG  compared to the previous corpora involved more diseases, entities, dialogs, and the utterances to alleviate the issue of data scarcity. Li et al. (2021) also released a high quality knowledge-aware medical conversation dataset (KaMed) from ChunyuDoctor, a large online Chinese medical consultation platform. Similar to previous datasets, Li et al., 2021) did not focus on Covid-19 disease.
We create and release a multi-turn dialog dataset named CDialog which contains 1K English consultations between patients and doctors along with medical entity annotated utterances. Finally, we propose an entity-aware neural medical conversa-tion model that generates appropriate responses by utilizing the annotated entities.

Resource Creation
In this section, we describe the details of resource creation.

CDialog Dataset
We extend the CovidDialog dataset  with the dialogs from the diseases which are the symptoms of Covid-19 and named it as Ext-CovidDialog which now contains approximately 10K dialogs. The motivation for extending the dataset comes from the fact that a conversation about Covid-19 can benefit from the conversations about fever, cough, cold, and other symptoms of Covid-19. We used online platforms of health service consultations such as icliniq.com and heathcaremagic.com to crawl data for fever, cough, etc. We extended the dialog length of 1K dialogs (from 2 to 8) using the dialogs from Ext-CovidDialog (contains ∼ 10K dialogs) and also annotated them with several medical entities. The resulting dataset is named as CDialog which is finally our proposed dataset for this work.
Our motivation is within the scope of building a conversational system that would engage in online conversation with the users. While developing an automated conversational system, generating longer responses is often a problem for the deep learning models. Hence, we have manually broken this longer utterance into multiple turns. We interacted with the medical experts in our university hospital to ensure that such splitting does not distort the crucial health-related information, rather we added generic statements in order to maintain the flow of the conversation.

Construction Details
As shown in Figure 1, we show a sample of the created and annotated conversation from the CDialog dataset. The average number of utterances in the crawled data (Ext-CovidDialog) is 2.0 per conversation, and the average number of tokens in an utterance is 103. As a result, this conversation is more akin to a question-and-answer session, with the patient describing their problem in detail and the doctor thoroughly answering each question. We aim to convert this question-and-answer (c.f Figure 1 left) setup into a multi-turn human-like conversation format (c.f Figure 1 right). For this, we first view the patient query (c.f Q in Figure 1) as a combination of individual sentences such that each sentence represents some meaningful intent. Then, we choose an appropriate sentence to start the conversation. For each chosen sentence from the patient's query, we search for its significant response in the doctor's answer (c.f A in Figure  1). We have introduced/modified the dialogs in between as needed to ensure that all dialogs are continuously readable and do not go out of context. Because medical data annotation involves annotators with proficient medical knowledge, the annotation cost is high. We employ four annotators with relevant medical expertise. Before beginning the annotation process, we explained the annotation guidelines (c.f Appendix B) using a few examples from the dataset to the annotators. We observe a Fleiss' kappa (Fleiss, 1971) score of 0.85 among the annotators denoting good agreement between them for the task of converting single turn dialogs into multi turn dialogs.
Medical Entity Annotation: We choose the following seven different kinds of entities for annotation after consulting with domain experts: Diseases such as allergic conjunctivitis, allergic cough, bacterial conjunctivitis, and so forth; Symptoms such as pneumonia, body ache, cough and so on; Medication such as anti-allergic tablets, betadine gargle solution, hydroxychloroquine and so on; Medical Tests, such as x-rays, etc; Medical history, which may be "clinical" or "non-clinical"; Remedies such as gargle, exercise, and so on; and other factors such as age, nature of pain, duration, and location. As a result, we have 253 entities consisting of 25 different medical tests, 87 different symptoms, 138 different medications, 12 different diseases, 2 different medical histories, 10 unique remedies and 4 other aspects. The distribution of entities in the CDialog dataset is depicted in Figure 2. It shows the proportion of entities in each of the seven categories. Each utterance of the conversation is labeled separately using the seven entity categories, as shown in the right side of Figure 1. The annotation process involved four annotators with relevant medical backgrounds. They begin by discussing the creation of an annotation template. Each participant annotates a small portion of the data and reports the confusing utterance. We summarize our observations and then revise the annotations once more. We observe a Fleiss' kappa (Fleiss, 1971) score of 0.89 between annotators denoting great agreement between them for the entity annotation task.
More details on the platform and annotators payment can be found in the Appendix B.

Dataset Statistics and Comparision to Existing Dataset
As a result of the annotation process as described in Section 3.1.1, the CDialog dataset contains 1012 English consultations about Covid-19 and Covidrelated symptoms, such as allergic conjunctivitis, allergic cough, bacterial conjunctivitis, and so forth, which aids in building the multi-turn dialog generation model. The total amount of tokens is 1,085,204 and the total number of utterances is 7,982. The average, maximum, and minimum number of utterances are 8.0, 48, and 2, respectively. The average, maximum, and minimum number of tokens in an utterance are 136, 5313, and 2, respectively. The dataset statistics is shown in Table 5 in the Appendix A. We compare our proposed CDialog dataset to the other publicly available datasets in Table 1 and observe that only three out of the many available datasets as mentioned in Section 2 are in English. When compared to these datasets, we find that the average dialogue length in CDialog is eight, indicating that it is more conversational in nature, and our dataset is the largest, focusing solely on Covid-19 with entity annotation for developing entity-aware language models.

Task Definition
The goal of a medical dialog system is to provide context-consistent and medically inclined responses based on conversation histories. Formally, given the history of conversations between doctor and patient comprising of K utterances, X = X 1 , X 2 , ..., X i , .., X K , where X i is either a doctor's or a patient's utterance. Each utterance is tagged with an entity set E = e 1 1 , ..., e 1 s , ...e K 1 , ..., e K s , where s is the total number of entities associated with an utterance, X i . The response generation task is to generate Y = y 1 , y 2 , ..., y M with M words given the set of previous K utterances with entity set e K s . The architecture is shown in Figure 3.

Entity-aware Dialog Model
Since generative models are inapplicable to our dataset's annotated entity labels, we present entity-aware models that make use of the supplementary entity knowledge. In this method, the entity set after the dialog history is directly concatenated as new input text and then used to encourage the models for generating the relevant responses.
where the [CLS] token is inserted at the start of the sequence to indicate the beginning of the sentence. The [SEP ] token denotes the end of a sentence and distinguishes one sequence from the next. Each token is first embedded through three layers (Token, Segment, and Position). The hidden states are obtained by feeding the respective vectors obtained from these three embedding layers into the BioBERT encoder. Furthermore, the hidden vector for each i-th word in the input utterance is denoted as H k−1 i . The bidirectional nature of BioBERT ensures joint conditioning on both the left and right contexts of a token. Then, using a BioBERT decoder, we generate the doctor's response, Y = y 1 , y 2 , ..., y M , using the words from the gold response, X k = (x k 1 ,x k 2 ,.....,x k |X k | ) every time. The decoder predicts each word, y j , conditioned on x k 1 ,...,x k j−1 , 11377

Training Loss
The decoder loss is the cross-entropy between the output distribution P (y k j ) and the reference distribution, T j , denoted as Loss = − T j log(P (y k j ))) (4)

Experimental Setup
This section describes the baseline models and evaluation metrics. Implementation details can be found in the Appendix C.

Baselines
We use the following baseline models: 1. GPT-2 (Radford et al., 2019): It is a language model based on Transformer pretrained on Reddit dialogs, in which the input sequence is passed through the model to generate conditional probability on the output sequences.
2. DialogGPT finetune (Zhang et al., 2020a): The model was trained using 147 million Reddit chats and is based on the OpenAI GPT-2 architecture. We begin by concatenating all dialog turns within a dialogue session into a long text that is terminated by the end-of-text token.
3. BERT (Devlin et al., 2018): This model makes use of Transformer attention mechanism which learns contextual relations between the words (or, sub-words) in a text. BERT as an encoder is used to encode the input and BERT as a decoder is used to generate relevant output.
4. BART (Lewis et al., 2019): In this model a bidirectional encoder is used for encoding the input sequences and the appropriate response is generated using a left-to-right decoder.
5. BioBERT (Lee et al., 2020): BioBERT is a model similar to BERT aside from that it has been pre-trained on a large biomedical corpus. It outperformed BERT and other state-of-the-art models in several tasks of biomedical text analysis. We use BioBERT both as the encoder and decoder.
The entity set after the dialogue history is directly concatenated as new input text in BERT-Entity, BART-Entity, and BioBERT-Entity and then used to stimulate the models to produce the relevant responses.

Automatic Evaluation
We evaluate our models on test set, using the following standard metrics. The BLEU (Papineni et al., 2002) score computes the amount of word overlap with the words from the ground truth response. ROUGE-L (Lin, 2004) measures the longest matching sequence of words between the candidate and the reference summary using longest common sub sequence method. Perplexity (PPL) is computed to learn how well the system learns to model the dialog data. We also compute unigram F1-score 2 between the predicted sentences and the ground truth sentences. Embedding-based metrics 3 (Liu et al., 2016b) such as Greedy Matching, Vector Extrema and Embedding Average are an alternative to word-matching-based metrics. These metrics assign a vector to each word in order to comprehend the desired sense of the predicted sentence, as described by the word embedding.

Human Evaluation
To evaluate the quality of generated responses from a human point of view, we randomly select 50 dialogs from each model developed using the CDialog dataset and analyze the predicted responses with the assistance of three human evaluators. For each example, we provide the responses (generated by models and ground-truth by humans) to our annotators. Human raters are post-graduates in science and linguistics with annotation experience for text mining tasks. We also had our model outputs validated by a doctor with a postgraduate degree in medicine. The important medical information was found to be retained in the responses. To assess the accuracy of our model predictions, we employ the following metrics: (i) Fluency: It is a measure of sentence's grammatical correctness. (ii) Adequacy: This metric is used to determine whether the generated response is meaningful and relevant to the conversation history. (iii) Entity Relevance (ER): This metric is used to determine whether or not a response contains the correct medical entities.
The scale runs from 1 to 5. The higher the number, the better. For the fluency metric, the ratings refer to incomprehensible, disfluent, non-native, good and flawless English, respectively. Similarly, for the adequacy metric these correspond to none,   little meaning, much meaning, most meaning and all meaning, respectively. The ratings from the various annotators are averaged and shown in Table 3. We compute the Fleiss' kappa (Fleiss, 1971) score to measure the inter-annotator agreement.
6 Results and Analysis Table 2 and Table 3 show the automatic and human evaluation results of baselines and the proposed models. Table 2 shows the results using automatic evaluation metrics on the CDialog dataset. On most metrics, we see that BioBERT-Entity outperforms Bert-Entity and BART-entity models 4 , demonstrating the effectiveness of incorporating medical entities with biomedical embeddings as additional learning signals for improving the task of medical dialog generation. Overall, we observe that entity based models tends to perform better and capture majority of the entities present in the dialog. On CDialog, BioBERT-Entity yields a significant performance improvement by a margin of around 4 We did a t-test (Lehmann and Romano, 2006) with the null hypothesis between proposed (BioBERT-Entity) and best baseline(BioBERT) (and BART and BERT with and without entity). For both settings the p-value was less than 0.001, indicating that the proposed methods significantly outperform the baselines. 12.25% in F1 score, and 52.94% in BLEU-4 on the test set when compared to the strongest baseline, BioBERT. Apart from word overlapping based metrics, we also notice significant improvement in embedding based metrics denoting efficient decoding using relevant entity information. Comparison to more baseline models can be found in Appendix D.1. Table 3 shows the result of human evaluation. Entity based models outperform the baseline models on fluency, adequacy, and medical entity relevance, demonstrating consistency with automatic evaluation results. All of the kappa values are greater than 0.75, indicating that the annotators are in good agreement.

Human Evaluation Results
In Table 4, we present a few example conversations as predicted by the entity based BioBERT-Entity, BART-Entity and vanilla BioBERT and BART models on the test set from CDialog corpus. As seen in the first example, BioBERT-Entity correctly decodes the response by utilising the context information and provides counselling to the patient. In the same example, we may note that BioBERT-Entity, as opposed to models without entity information like BioBERT, generates a more adequate response by utilizing the entity "diabetes".

Error Analysis
Using the generated responses, we conduct a thorough examination of our proposed model and classify the errors it encounters into the following categories: 1. Generic Response: We see cases of generic responses by the doctor, such as "would you like to video or text chat with me?" in about 4% of our predicted responses by all the models leading to reduced medical entity presence. The reason being that there for many samples in the training BioBERT no problem, glad to be of help . be safe and avoid hand washing or dusty hands. BART-Entity please call your doctor as soon as possible. if he develops some signs of covid -19, he should be examined and tested as soon as possible.
BART your welcome data where such responses are present to maintain proper information flow and leading to a reasonable conversation.
2. Non-Fluency: We observe around 5% cases of non fluency such as "if you were feeling ?", "yes, we can think you give me?" mostly for BERT and BART models. The reason for this is that these models do not take into account medical entities because they are not trained on biomedical data, which leads to inconsistency in responses since they miss important medical terms while predicting responses.
3. Inadequacy: The model sometimes fails to predict correct responses for patient utterances having a large set of context utterances. For example in Table 4, we may observe in the second sample that since the conversation history comprises of more than six utterances. The model fails to keep track of the previous information and hence generates an inadequate or a generic response.
4. Incorrect entity prediction: In around 10% cases, the model predicts some irrelevant medical entities resulting in contextually incorrect responses. For example, Patient: i am experiencing nasal congestion, sneezing (unaffected by: recent exposure to allergens, exposure to secondhand smoke), sore throat, itchy eyes, ear pressure, nasal drainage, post nasal drip, eye irritation, runny nose, and watery eyes; Doctor: i think it is itching/congestion. with the itching could be seasonal allergies would consider benadryl 1/2 to 1 tab at bedtime and zyrtec during the day. itching is pretty specific for allergies?; Predicted Response: hi, also called urti-allergy. have you taken any medicines? As can be seen, the predicted response missed all of the entities mentioned in the patient's utterance. However, the reason could be that because many entities were mentioned in the utterance, the model was confused and mentioned "urtiallergy" which is also very close to the mentioned symptoms.
More details on the performance of baseline models on these errors can be found in Appendix E.

Conclusion
In this paper, we have created an enriched multiturn medical dialog corpora with manually labeled medical entities. The dataset is typically constructed for the purpose of developing an efficient medical dialog system, with an average dialog length of 8. To facilitate effective conversation understanding and generation, we propose an entityaware neural conversational model for medical dialog generation. The evaluation results on two benchmark medical datasets show that a BERTbased model with biomedical embeddings and relevant medical entities can successfully generate correct and informative responses.
In the future, we aim to use a medical knowledge graph generated using a UMLS database to provide domain knowledge into medical dialogues and model the relationship between different medical entities. The codes and dataset used to replicate our findings are available at https://github.com/deekshaVarshney/CDialog; https://www.iitp.ac.in/ãi-nlpml/resources.htmlCDialog.

Ethical Declaration
All of the datasets used in this study are freely available to the public which are collected from public websites. We followed the policies for using those data and did not violate any copyright issues. The dataset used in this paper is solely for academic research purposes. In a real-world application, medical dialogue systems could be used to counsel patients and collect data for diagnosis. Even if the agent makes a few minor mistakes during the process, doctors will eventually take over in the end. Annotation was done by a dedicated team of people who work full-time. Dataset is medically verified by the health department of our institute. We are not disturbing any health related information and only adding generic statements in order to maintain the flow of the conversation. We further got the data collection and annotation process reviewed by our university review board.

Limitations
Detailed cases of limitations by our model is described in Section 6.3. Modelling medical entities is a challenging task in dialog generation. We aim to further investigate this task in the future.
Learning discourse-level diversity for neural dialog models using conditional variational autoencoders.
In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 654-664.
Yufan Zhao, Wei Wu, and Can Xu. 2020. Are pre-trained language models knowledgeable to ground open domain dialogues? arXiv preprint arXiv:2011.09708. Table 5 presents the dataset statistics for the proposed CDialog dataset. The dataset is split into 80:10:10 ratio for preparing the training, test and validation sets. We conduct several experiment to show the effectiveness of the annotation of entities. They are described as follows. Since, we have broken the longer utterances into short utterances, having extra information in the form of entity annotation is clearly useful. This is already demonstrated by our experiments in Table 2, by building models both with and without entities. The results clearly show improvement in performance for models with entity. Similarly, we conduct an additional experiment with the Ext-CovidDialog dataset and observed that with the entities there is no improvement in the model. Hence, showing that for shorter utterances the entity annotation is more useful. Results on Ext-CovidDialog: BioBERT -F1-score: 0.222; BioBERT + Entity -F1-score: 0.211

B Annotation Details
Annotation Guideline: Given a query from patient and an answer from doctor, the task is to convert it into a multi-turn dialog by selecting sentences from the query-answer pair such that they form a sensible multi-turn conversation. Each turn in the conversation contains an utterance by the patient and a response by the doctor. Figure 4, shows an overview of the pipeline for creating the multi-turn dialog data.
1. For each sample query-answer pair, we employ two annotators, one who produces utterances for the patient and one who acts as a doctor and selects relevant sentences as responses. This configuration has several advantages over using a single annotator to serve as both a patient and a doctor such as when two annotators chat about a passage, their dialogue flow is natural and when one annotator responds with a vague response, the other can raise a flag, which we use to identify bad workers.
2. Both the acting patient and doctor sees the original query and answer and also the conversation that happened until now i.e utterances and response from previous turns.
3. While framing a new utterance for starting the conversation, we want annotators to see the longer query and mostly pick the first sentence as their utterance and modify accordingly to begin the conversation. For example, as shown in Figure 1, the annotator picks the " I am a 23-year-old man" sentence from Q and adds "and I have some queries regarding coronavirus. Can you help me?" in order to start the conversation.
4. While responding, we want the annotator to look into the longer answer (c.f. A in Figure  1) and pick the appropriate sentence as the doctor's utterance and we further ask them to sometime respond with only generic sentences such as Is there anything else you wanna tell?
(c.f X 2 ) to generate a natural conversation.
5. For medical entity annotation, seven empty columns are provided to choose the relevant medical term for the different categories as defined in Section 3.1.1. For example in Figure  1, for utterances X 4 , the relevant medical entities to be annotated are Symptom: Anxiety; Disease: Covid-19. The annotators were also asked to remove any names to anonymize the data.
Annotators details: The annotators are regular employees (paid monthly as per university norms) at the rate of 35k/month. The annotators have been employed in our research group and they have been working on similar projects since the last three years.

Read the whole Q-A thread
Choose an appropriate sentence from the patient's query (Q) to start the conversation.

C Implementation Details
All the experiments are implemented using Pytorch framework. BART and BioBERT had hidden size of 1024 while BERT had hidden size of 512. The number of layers is set to 2, 12 and 6 for BERT, BART and BioBERT model respectively. For all the three model BERT, BART and BioBERT number of parameters were 96764928, 457762816 and 360749056 respectively. We use grid search to get the optimal hyperparameter values. We use the AdamW optimizer with learning rate fixed to 0.0005 and the beam size set to 1, while decoding the responses. We choose the best model when the loss on the validation set does not decrease further. We use the GeForce GTX 1080 Ti as the computing infrastructure. Each model is trained up to 30 epochs. After three runs with different random seeds for each method, the variances of the results are at most 1e-4, and they have no impact on the trend.

D.1 Automatic Evaluation
We also compare our proposed approaches with LSTM based state-of-the-art models such as Seq2Seq (Vinyals and Le, 2015), HRED (Serban et al., 2015) and VHRED (Serban et al., 2017). Seq2Seq obtains a F1-score of 5.20 and BLEU-4 score of 0.001 on test set of our proposed CDialog dataset. HRED obtains a F1-score of 5.67 and a BLEU-4 score of 0.003 with an embedding average, extrema and greedy score of 0.611, 0.302, 0.542 respectively. VHRED obtains F1-score of 6.11 and a BLEU-4 score of 0.003 with an embedding average, extrema and greedy score of 0.621, 0.304, 0.552 respectively.

E Error Analysis
Performance of baseline models on Inadequacy and Incorrect entity prediction. 1. Inadequacy: The prediction by baseline models BART and BioBERT models is shown in Table  4. As can be seen, the baseline models also struggle to maintain track of information, resulting in an insufficient or generic response.
2. Incorrect entity prediction: For the example shown in 6.3, 4-th point, the performance of baseline models is as follows: BERT: have you been recently? please send for any more information. i have read your query in detail. BART: do you have family history? BioBERT: not allergy. if you have already taken antibiotics, it may help. did you have any other contact with a doctor? It can be noted that the baseline models perform even worse than the models with entities in terms of retaining relevant clinical information in the predicted response.