MidMed: Towards Mixed-Type Dialogues for Medical Consultation

Most medical dialogue systems assume that patients have clear goals (seeking a diagnosis, medicine querying, etc.) before medical consultation. However, in many real situations, due to the lack of medical knowledge, it is usually difficult for patients to determine clear goals with all necessary slots. In this paper, we identify this challenge as how to construct medical consultation dialogue systems to help patients clarify their goals. For further study, we create a novel human-to-human mixed-type medical consultation dialogue corpus, termed MidMed, covering four dialogue types: task-oriented dialogue for diagnosis, recommendation, QA, and chitchat. MidMed covers four departments (otorhinolaryngology, ophthalmology, skin, and digestive system), with 8,309 dialogues. Furthermore, we build benchmarking baselines on MidMed and propose an instruction-guiding medical dialogue generation framework, termed InsMed, to handle mixed-type dialogues. Experimental results show the effectiveness of InsMed.


Introduction
Current medical dialogue systems (Xu et al., 2019;Liao et al., 2020;Zeng et al., 2020;Liu et al., 2022a) mainly focus on diagnosis by obtaining symptoms and then making diagnosis automatically. These dialogue systems have shown significant potential and alluring technological value to simplify diagnostic procedures (Semigran et al., 2015). Previous works assume that patients have explicit goals (medicine querying, surgical operation querying, etc.), and perform in the way of taskoriented dialogue to accomplish patients' goals.
However, explicit patient goals are usually unavailable in real-world scenarios. For example, a patient wants to consult about his itchy skin but lacks medical knowledge. Thus, it is difficult for the patient to decide which slots (e.g. medicine or a surgical operation) are needed. To figure out explicit patient goals, medical consultation services are needed, which provide advice of treatment, medicine, food, etc., as shown in Figure 1. However, those medical consultation services are under explored in previous works.
To facilitate the study of medical consultation, we construct a new human-to-human mixed-type dialogue dataset for medical consultation (MidMed), covering five dialogue types: task-oriented dialogue for diagnosis, knowledgegrounded dialogue, QA, recommendation, and chitchat. MidMed is constructed by revising dialogues of MedDialog (a human-to-human medical diagnosis dialogue dataset) (Zeng et al., 2020). As shown in Figure 1, a patient queries about "sweaty hands", and has no explicit goal for medicine or a surgical operation. In the scenario, the doctor first collects the symptoms and makes a diagnosis. To help clarify the patient's goal, the doctor further recommends medicine and food, replies for foods to avoid, and gives emotional comfort. Through the consultation, the patient determines to apply "dexamethasone cream" and have more "tomatoes". Finally, MidMed is obtained, containing 8,175 dialogues and 98,000 utterances, with at least three dialogue types in each dialogue.
To promote research on medical consultation dialogue systems, we conduct benchmarking experiments on MidMed for end-to-end dialogue generation. Furthermore, to generate informative and relevant responses with dialogue topic sequences, inspired by Schick and Schütze (2021); Wei et al. (2021), we present an instruction-guiding medical dialogue generation framework (InsMed) to handle mixed-type dialogues. InsMed is composed  of a dialogue topic selection, a reference knowledge selection, and an instruction-based response generation module. Specifically, the topic selection module and the reference knowledge selection module are designed to pick suitable dialogue topics and reference knowledge for generating responses, respectively. Then, dialogue topics and reference knowledge are converted to instructions in natural language with well-designed templates. For example, an instruction is "In the next utterance, the doctor will recommend a diet. The recommended diet is fruits and vegetables". These instructions are concatenated with dialogue context as the input to generation models.
This work makes the following contributions: • We identify a new challenge, that is, in many real-world scenarios, it is usually difficult for patients to have clear goals before medical consultations.
• To mitigate this challenge, we propose a novel task, medical consultation over mixed-type dialogue, and collect a new Chinese human-tohuman mixed-type dialogue dataset, in which each session has rich variability of dialogue types with natural topic transitions.
• We build baselines on MidMed and propose an instruction-guiding response generation framework InsMed to address this task. Experimental results show the effectiveness of InsMed.
2 Related Work

Dialogue Systems for Diagnosis
There has been growing research interest in developing dialogue systems for automatic diagnosis. These dialogue systems aim to assist doctors in pre-collecting symptoms and patient information and then give patients diagnoses in time. These works are divided into two categories, the pipeline manner, and the end-to-end manner. Wei et al.  Liu et al. (2022a) break the systems into natural language understanding, dialogue management, and natural language generation, in a pipeline manner. Then, these three modules are trained with respective annotated data and feed their output to the next module. Meanwhile, Zeng et al. (2020) tries to build an end-to-end model on large-scale unannotated medical dialogue data. Compared with the pipeline manner, the end-toend manner has no requirement for the annotated dataset but has no supervision for the intermediate state.
In addition to methods, many datasets are also publicly available. The medical dialogue datasets are listed in Table 1. Among them, MZ , DX (Xu et al., 2019), CMDD , MedDG (Liu et al., 2022a), and Di-aloACM  are datasets of pipeline dialogue systems for automatic diagnosis. MedDialog (Zeng et al., 2020) is a large-scale unannotated dataset, utilized for end-to-end training.
These medical dialogue datasets focus on diagnosis, and ignore consultation. Compared with these datasets, MidMed is a medical dialogue dataset for consultation, covering mixed-type dialogues.

Mixed-type Dialogue Systems
Recently, research on the mixed-type dialogue has increased significantly. These researches fall into two categories: (1) train an all-in-one conversation model by using multiple single-skill conversation datasets, such as persona-chat, task-oriented dialogue, to bind multiple dialogue skills (Madotto et al., 2020;Madotto et al., 2021); (2) collect mixed-type dialog datasets Smith et al., 2020;Sun et al., 2021;Chiu et al., 2022;Liu et al., 2022b) to train mixed-type dialog models. Those datasets are intended to mix different dialogue skills to meet specific needs, such as recommending movies and songs, and are unable to solve medical consultations. Compared with them, we collect a mixed-type dialogue corpus, MidMed, to facilitate the study of medical consultations.

Dataset Collection
In this section, we describe the three steps for MidMed construction: (1) Selecting basic diagnosis dialogue data; (2) Constructing annotation guidance; (3) Collecting mixed-type dialogue by crowdsourcing.

Selecting Basic Diagnosis Dialogue
To be close to real-world scenarios, MidMed is constructed based on real diagnosis dialogue dataset MedDialog (Zeng et al., 2020), which is collected from online medical community haodf.com.
MedDialog dataset contains 3.4 million Chinese dialogues (consultations) between patients and doctors, covering 29 broad categories of specialties including internal medicine, pediatrics, dentistry, etc., and 172 fine-grained specialties including cardiology, neurology, gastroenterology, urology, etc.
Basic Dialogue Selection. For MidMed construction, we recruit twenty medical students, who are experts in four departments, otorhinolaryngology, ophthalmology, skin, and the digestive system department. To ensure better data quality and construction efficiency, the dialogues only in these four departments are reserved. Besides, we observe that dialogues with few dialogue utterances are usually of poor quality. Thus, for high data quality and efficiency of data construction, only those conversations with more than four utterances are kept. After the above data processing, there are total 9,000 dialogues obtained.
Coarse-grained Privacy Removing. Furthermore, for ethical concerns, specific regular expressions for coarse-grained filtering are employed to remove privacy. To delete patients' privacy, regular expressions, such as " 我叫... (My name is ...)", are designed to delete sentences containing name, gender, and region. Besides, regular expressions, such as " 陈医生您好... (Hello, doctor Chen, ...)", are utilized to delete doctors' privacy.

Constructing Annotation Guidance
Annotation guidance is designed to instruct annotators for data annotation, including target dialogue topic sequences and reference knowledge. Specifically, target topic sequences assign topics for each dialogue session. To support the annotation of each topic, reference knowledge is provided.

Target Dialogue Topic Sequence
Due to the complexity of the data annotation, it is of great difficulty to conduct data annotation with only high-level instructions. Inspired by the work of MultiWOZ (Budzianowski et al., 2018), we provide a target dialogue topic sequence for each dialogue construction. The dialogue topic sequences are employed to instruct annotators to annotate the content of specific topics. As shown in Figure 1, the target dialogue topic sequence is composed of dialogue topics, including Patient Self Report, Doctor Inquiry Additional, Doctor Recommend Medicine, etc. The whole dialogue topic sequences are shown in Figure 2. The combination of different topics ensures the diversity of dialogue topic sequences.

Reference Knowledge
The knowledge graph stores large-scale knowledge in the form of easy-to-use triples, and it has various applications in all modules of the human-computer dialogue system (Tuan et al., 2019(Tuan et al., , 2022. Therefore, we incorporate knowledge graphs into medical consultation to provide more accurate interactive questions and answers. Specifically, we crawled a large number of web pages from some high-quality medical vertical websites such as 39.net 2 and then obtained a large amount of triplet knowledge by using information extraction techniques such as entity extraction and relation extraction. By using these triples, a large-scale medical knowledge graph is constructed, whose  entities include diseases, symptoms, drugs, foods, etc., and relationships include disease-drug relation, disease-food relation, etc.
To provide reference knowledge for dialogue annotation, we extract a knowledge graph subset for each dialogue. Specifically, diseases in the whole knowledge graph are mapped with the dialogue with exact string matching. The disease existing in the medical dialogues are employed as the head entity for select triples from the knowledge graph. Finally, we extract a knowledge graph subset, which covers four types of entities: disease, symptom, diet, and medicine, with a total of 229,570 triples.

Collecting Mixed-type Dialogue
For data annotation, the trial annotation and the formal annotation are conducted, sequentially. First, the trial annotation aims to select an annotation team and make the annotation team get familiar with the guide. Second, the formal annotation is conducted for collecting the whole dataset.

Trial Annotation
To ensure the high quality of dialogues, trial annotation is conducted. In the trial annotation stage, three crowdsourcing teams (about 20 annotators per team) are selected for trial annotation. There are mainly two advantages. (1) Trial annotation helps select a reliable annotation team. (2) The trial annotation helps the annotation team get familiar with the annotation task. Lastly, the team achieving the best performance in the trial annotation is selected for the formal annotation.

Formal Annotation
After the trial annotation, the formal annotation is conducted. In the formal annotation, to ensure data quality, the fine-grained privacy removing, skipping option, and quality audit and re-annotating mechanisms are employed. To ensure diversity, the mechanism of annotation without target dialogue topic sequences is applied.
Overall Annotation. In the formal data annotation process, annotators are required to act as doctors and patients in turn. Annotators construct dialogues based on a given basic diagnosis dialogue, a target dialogue topic sequence, and reference knowledge. The annotation progress is conducted as follows. First, the annotator enters the chat interface to start chatting, and the "patient" initiates the conversation. Second, annotators conduct a dialogue based on the dialogue topic sequence. It is important that the information utilized in the dialogue conforms to the reference knowledge. After successfully mentioning all target topics in sequence, the "doctor" ends the conversation.
Furthermore, we introduce the fine-grained privacy removing, the skipping option, quality audit and re-annotating to improve data quality, and introduce the annotation without target dialogue topic sequence mechanism to improve data diversity.
Fine-grained Privacy Removing. In the data annotation process, for better data quality, annotators are also required to delete privacy that cannot be covered by regular expressions, including gender, age, name, institution name, etc.
Skipping Option. We observe that there are many basic diagnosis dialogues with low quality.
These bad dialogues may lead to annotated dialogues of low quality. To alleviate the issue, a skip option is provided to annotators. Specifically, annotators can choose whether to annotate the given basic diagnosis dialogue or not to the quality of the given dialogue. If annotators choose "Skip", they then skip the current dialogue directly and conduct the annotation of the next dialogue.
To ensure the option is not being overused, we review all the skipped conversations and select highquality dialogues from the skipped conversations. Those high-quality dialogues are returned to the annotation process, and the rest low-quality dialogues are abandoned.
Quality Audit and Re-annotating. To deal with low-quality samples, we introduce the quality audit and re-annotation mechanism. Specifically, we review all the annotated samples and pick out lowquality dialogues. These low-quality samples are returned to the annotation team for re-annotation.
Annotation without Target Dialogue Topic Sequence. Though the target dialogue topic sequences lead to good annotation quality, they usually lead to monotonous dialogue structures. To address the issue, annotators are also allowed to construct the dialogues without following the target dialogue topic sequences. This option enables annotators to construct more diverse and flexible dialogues based on the basic diagnosis dialogues. Meanwhile, to prevent this option from being abused, this option is required to be used for no more than ten percent of the whole annotation data.
Data quality. Following , for data quality evaluation, we employ human evaluations. Specifically, we assign "1" for dialogues coincident with annotation guidance, and "0" for the others. Then, we conduct a quality evaluation on 100 randomly sampled dialogues. Finally, an average score of "0.90" is achieved. The result indicates that the dialogues in the dataset are with high quality.

Method
During training, a dialogue, with a sequence of utterances between a patient and a doctor, is given. Then, the dialogue is processed into a set of sam- In the next utterance, the doctor recommend medicine.

Reference Knowledge
The recommended medicine is dexamethasone cream.    response, s i is the concatenation of all former utterances before t i , and D is the training dataset. Dialogue generation is formulated as a sequenceto-sequence generation problem, which aims to generate t i conditioned on s i .
InsMed has three modules, dialogue topic selecting, reference knowledge selection, and the instruction-guided generation module. The dialogue topic prediction and the reference knowledge selection module aim to obtain dialogue topics and reference knowledge, respectively. Then, for better generation performance, these two types of information are transformed into instructions in natural language. Finally, instructions are concatenated with context, as the input to generation models. Next, the above modules are introduced.

Dialogue Topic Selection
The dialogue topic selection module is divided into two stages, the dialogue topic prediction, and the dialogue topic converting.
The dialogue topic prediction aims to predict dialogue topics for the next utterance. Formally, this task is regarded as a multi-class classification problem. Specifically, the input of the prediction module is a dialogue context s i , and the output is the predicted dialogue topics. The classification process is formulated, where f is the classification function BERT (Devlin et al., 2018) and p i ∈ |R| |C| is the predicted probability value, C is the predefined category set. The dialogue topic a i is selected as the predicted dialogue topic if the value of the dimension is the highest probability value in p i .
Then, in the dialogue topic converting stage, a i is converted into natural language with predefined templates, represented asã i . For example, the predicted topic is Recommend Medicine, and the converted instruction is "In the next utterance, the doctor will recommend medicine".

Reference Knowledge Selection
The reference knowledge selection module aims to obtain the reference knowledge for model generation, thus guiding models to generate more informative responses. The module is divided into two ROUGE NIST-4 BLEU BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR BST (Smith et al., 2020) 13  parts, knowledge retrieval, and reference knowledge converting. The knowledge retrieval module aims to retrieve reference knowledge from the whole knowledge graph for response generation. An exact string match is utilized for retrieval. Specifically, the diseases d m i=1 in the whole knowledge graph are mapped with medical dialogues with exact string matching, where m is the number of diseases. The disease d i existing in the medical dialogues are regarded as related diseases of the dialogues. Then, the reference knowledge is obtained by inquiry the knowledge graph with d i , e = {tail|{head, relation, tail} ∈ KG, head = d i , relation = r}, where r is the slot in the predicted dialogue topic a k . For example, if the dialogue topic is Doctor Recommend Medicine, r is Medicine, and e = {bonmopirocin ointment, dexamethasone cream}.
Then, in the reference knowledge converting, e is converted into natural language with predefined templates, represented asẽ i . As the example in Figure 3, the converted knowledge instruction is "the recommended medicine is bonmopirocin ointment and dexamethasone cream".

Instruction-guiding Generation
The Instruction-guiding generation module aims to generate accurate and informative responses with instructions.
The problem of response generation is formulated as a sequence-to-sequence task (Sutskever et al., 2014). The input to the generation model is the concatenation of the dialogue context s i , the predicted dialogue topic instructionã i , and the reference knowledgeẽ i . The output is the doctor's response t i .
BART (Lewis et al., 2020) is utilized as the gen-eration model. Then, the forward calculation process is formulated, where f g represents the generation model BART.

Experiments and Results
This section introduces experimental setting, data and evaluation metrics, baselines, automatic evaluations, human evaluations, and the ablation study.

Experimental Setting
Implementation Details. For Transformer, the implementation by HuggingFace 3 is utilized, where the hyperparameters follow the default settings in the original Transformer (Vaswani et al., 2017).
For DialoGPT-small (Zhang et al., 2018), the layer number, the embedding size, and the context size are set as 10, 768, and 300, respectively. In layer normalization, the epsilon hyperparameter is set as 1e-5. In multi-head self-attention, the number of heads is set as 12. The weight parameters are learned with Adam, with the initial learning rate 1.5e-4 and the batch size 32.
For BART, the large version is employed, with the learning rate 2 × e −5 . In BART, the BERT encoder and GPT decoder are Transformers with 12 layers and a hidden state size of 768. The dropout rate is set as 0.1. The maximum length of input sequences is truncated to 512 and that of output sequences was truncated to 256.
Computing Platform. Our experiments are conducted on the workstation with an Intel Xeon E5 2.40 GHz CPU, 128 GB memory, an NVIDIA A100 GPU, and CentOS 7.2.

Data and Evaluation Metrics
We split MidMed into the training set, the validation set, and the test set by randomly sampling 70%, 10%, and 20% data.

Automatic Evaluation
Following Zeng et al. (2020), four basic automatic evaluation metrics for generation tasks are utilized in this work, including ROUGE (Lin, 2004), NIST-4 (Doddington, 2002), BLEU-n (Papineni et al., 2002) (where n is the size of n-gram), and ME-TEOR (Agarwal and Lavie, 2007). These metrics all measure the similarity between the generated responses and the ground truth via n-gram matching.

Human Evaluation
Following , three human evaluation metrics are utilized in this work, including relevance, informativeness, and human-likeness. Relevance measures fluency, relevancy and logical consistency of each response when given the current goal and global context: • score 0 (bad): more than two-thirds responses irrelevant or logical contradictory to the given current goal and global context.
• score 1 (fair): more than one-third responses irrelevant or logical contradictory to the given current goal and global context.
Informativeness examines how much knowledge (goal topics and topic attributes) is provided in responses: • score 0 (bad): no knowledge is mentioned at all.
• score 1 (fair): only one knowledge triple is mentioned in the response.
• score 2 (good): more than one knowledge triple is mentioned in the response.
Human-likeness examines similarity between each generated response with corresponding human response from the perspectives of appropriateness, fluency, and proactivity: • score 0 (bad): not like human responses.
• score 1 (fair): like human responses, but some parts still have deficiencies.

Baselines
We carefully select a few strong baselines for comparison. Specifically, two baselines for mixedtype dialogue generation (BST (Smith et al., 2020), MGCG ), a baselines for medical dialogue generation (VRbot (Li et al., 2021)), two common baselines for medical dialogue (Seq2Seq (Sutskever et al., 2014), Di-aloGPT ), and a baseline for general dialogue generation (BART (Lewis et al., 2020)) are used in this experiment. Besides, the proposed model utilizes the same data as these baselines, with domain-specific knowledge.
BST (Smith et al., 2020) is a mixed-type dialogue model that can display many skills, and blend them in a seamless and engaging way.
MGCG  consists of a goalplanning module and a goal-guided responding module. The goal-planning module conducts dialog management to control the dialog flow. The responding module generates responses for completing each goal.
VRbot (Li et al., 2021) introduces both patient state and physician action as latent variables with categorical priors for explicit patient state tracking and physician policy learning, respectively. A variational Bayesian generative approach is utilized to approximate posterior distributions over patient states and physician actions.
Seq2Seq (Sutskever et al., 2014) Sutskever et al. (2014 uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of fixed dimensionality, and then another LSTM to decode the target sequence from the vector.
DialoGPT ) is a large, tunable neural conversational response generation model based on GPT. DialoGPT is trained on 147M conversation-like exchanges extracted from Reddit comment chains over a period spanning from 2005 through 2017.
BART (Lewis et al., 2020) is a denoising autoencoder for pretraining sequence-to-sequence models. It is composed of a BERT encoder (a bidirectional encoder) and a GPT decoder (a left-to-right decoder).  Table 4: Human evaluation results of six baseline models, InsMed, and Groundtruth, on three aspects, including relevance, informativeness, and human-likeness. Scores of "0", "1", and "2" are assigned to each dialogue, where "0" represents bad samples and "2" represents good samples. The average scores are reported.

Automatic Evaluation
The results on automatic evaluation metrics are shown in Table 3. InsMed is compared with the other five generation models on various evaluation metrics. The results show the following conclusions. First, BART (large) is much better than other baseline generation models. The reason may be that BART (large) is much more powerful than other generation models, with more parameters and more training data.
Second, InsMed achieves state-of-the-art performance on almost all metrics. This demonstrates that instructions help BART to generate more accurate responses. Table 4 shows the human evaluation results on the test set of MidMed.

Human Evaluation
First, comparing BART, InsMed with other baselines, the results demonstrate that pre-training on large-scale data improves relevance, informativeness, and human-likeness. The reason may be that pre-training on large-scale data provides a large amount of common language knowledge.
Second, comparing InsMed with BART, the results show that InsMed performs better than BART, especially in relevance and informativeness. The reason may be that instructions in InsMed provide specific targets for generation, leading to a more relevant and informative response generation. Table 3 shows the ablation results, where "w/o Topic" means removing dialogue topic instructions from the InsMed and "w/o KG" means removing reference knowledge instructions from the InsMed. Results show that reducing any part leads to poor results. This illustrates the effectiveness of each part of the InsMed.

Conclusion
This work identified the challenge of helping patients clarify their goals through medical consultations. To address this challenge, this work proposed a novel task, medical consultation over mixed-type dialogue, and collected a new Chinese human-tohuman mixed-type dialogue dataset, in which each session has rich variability of dialogue types with natural topic transitions. To facilitate further research, we conducted benchmarking experiments on MidMed for end-to-end dialogue generation and proposed an instruction-guiding medical dialogue generation framework InsMed. Experimental results show the effectiveness of InsMed. In the future, we will investigate the possibility of crossdepartments (e.g. dermatology and endocrinology) medical consultation at low cost.

Limitation
InsMed is built based on the large-scale pretraining model BART, which requires high computing resources.

Ethical Statement
We make sure that MidMed is collected in a manner that is consistent with the terms of use of any sources and the intellectual property and privacy rights of the original authors of the texts. And crowd workers were treated fairly. This includes, but is not limited to, compensating them fairly, ensuring that they were able to give informed consent, and ensuring that they were voluntary participants who were aware of any risks of harm associated with their participation.