Doctor Recommendation in Online Health Forums via Expertise Learning

Huge volumes of patient queries are daily generated on online health forums, rendering manual doctor allocation a labor-intensive task. To better help patients, this paper studies a novel task of doctor recommendation to enable automatic pairing of a patient to a doctor with relevant expertise. While most prior work in recommendation focuses on modeling target users from their past behavior, we can only rely on the limited words in a query to infer a patient’s needs for privacy reasons. For doctor modeling, we study the joint effects of their profiles and previous dialogues with other patients and explore their interactions via self-learning. The learned doctor embeddings are further employed to estimate their capabilities of handling a patient query with a multi-head attention mechanism. For experiments, a large-scale dataset is collected from Chunyu Yisheng, a Chinese online health forum, where our model exhibits the state-of-the-art results, outperforming baselines only consider profiles and past dialogues to characterize a doctor.


Introduction
The growing popularity of health communities on social media has revolutionized the traditional doctor consultancy paradigm in a face-to-face manner. Massive amounts of patients are now turning to online health forums to seek professional help; meanwhile, popular healthcare platforms are able to recruit a large group of licensed doctors to provide online service (Liu et al., 2020b). In the COVID-19 crisis, the social distancing policies further flourish the use of these forums, where numerous patients would query diverse varieties of health problems every day (Gong et al., 2020). * Equal contribution. Yubo Zhang was supported by PolyU Undergraduate Research and Innovation Scheme (URIS). † Jing Li is the corresponding author. 1 Our dataset and code are publicly available in: https://github.com/polyusmart/ Doctor-Recommendation Nevertheless, in much practice (Cao et al., 2017), manual doctor allocations are adopted to handle each query, largely limiting the efficiency to help patients in sheer quantities and resulting in an extremely expensive process. Under this circumstance, how can we automate and speed up the pairing of patients to doctors who are able to offer the help?
In this paper, we present a novel task of doctor recommendation, whose goal is to automatically figure out a patient's needs from their query on online health forums and recommend a doctor with relevant expertise to help. The solution can not be trivially found from the mainstream recommendation approaches. It is because most recommender systems acquire the past behavior of target users (e.g., their purchase history) to capture their potential requirements ; whereas our target users -the patientsshould be anonymized to protect their privacy. Language features consequently play a role in our task because only a few query words are accessible for models to make sense of how a patient feels and who can best help them.
To illustrate our task, Figure 1 shows a patient's query q concerning insomnia and muscle aches, where it is hard to infer the cause of such symptoms from the short text, not to mention to recommend a suitable doctor for problem-solving. It is hence crucial to explore the semantic relations between patient queries and doctor expertise for recommendation. To characterize a doctor's expertise, the modeling of their profile (describing what they are good at) provides a straightforward alternative. Nevertheless, the profiles are usually written in a professional language, while a patient tends to query with layman's terms. For instance, the doctor D who later solved q's problem is profiled with "neurological diseases", whose correlations with the symptom descriptions in q are rather implicit. Therefore, we propose to adopt previous dialogues held by a doctor with other patients (henceforth dialogues) to narrow the gap of language styles between doctor profiles and patient queries. Take the history dialogues of D in Figure 1 as an example: the words therein like "dizziness", "muscular atrophy", and "cyclopyrrolones" (treatments for insomnia) are all helpful to bridge D's expertise in neurological diseases with q's symptoms.
To capture how a doctor's profile is related to their dialogue history, we first construct a selflearning task to predict whether a profile and a dialogue are from the same doctor. It is designed to fine-tune a pre-trained BERT (Devlin et al., 2018) and align the profile writing and colloquial languages (used in patient queries and doctor responses) into the same semantic space to help model a doctor's expertise. Profiles and dialogues are then coupled with the query embeddings to explore how likely a doctor is qualified to help the patient. Here multi-head attention in aware of the doctor profile is put over the history dialogues to capture the essential content able to indicate a doctor's suitability from multiple aspects, e.g., the capabilities of D in Figure 1 to handle both "insomnia" and "myopathy". Such design reflects the intricate nature of health issues and would potentially allow the models to focus on the salient and relevant matters instead of being overwhelmed by the massive dialogues a doctor has engaged, which may concern diverse points.
In comparison to other NLP studies concerning health forum dialogues (Xu et al., 2019;Zeng et al., 2020a), it is found that few of them attempt to spotlight doctors in these dialogues and examine how their expertise is reflected by what they say in these dialogues. Different from them, we explore doctor expertise from their profiles and history dialogues in order to fit a doctor's qualification to a patient's requests, which would advance the so far limited progress of doctor expertise modeling with NLP.
To the best of our knowledge, we are the first to study doctor recommendation to automate the pairing of doctors and patients in online health forums, where the joint effects of doctor profiles and their previous interrogation dialogues are explored to learn what a doctor is good at and how they are able to help handle a patient's request.
For experiments, we also gather a dataset with 119K patient-doctor dialogues involving 359 doctors from 14 departments from Chunyu Yisheng, a popular Chinese health forum. 3 The empirical results show that doctor profiles and dialogue history work together to well reflect a doctor's expertise and how they are able to help a patient. In the main comparison, our model achieves state-of-theart results (e.g., 0.616 by P@1), outperforming all baselines and ablations without employing selfsupervised learning and multi-head attention.
Moreover, we quantify the effects of doctor profiles, history dialogues, and patient queries in recommendation and our model shows consistently superior performance in varying scenarios. Furthermore, we probe into the model outputs to examine what our model learns with a discussion on multiple heads (in our attention map), a case study, and an error analysis, where the results reveal the potential of multi-head attention to capture various aspects of a doctor's expertise and point out the future direction to distinguish profile quality and leverage data augmentation and medical knowledge.

Data Collection and Analysis
Despite the previous contributions of large-scale data with doctor-patient dialogues (Zeng et al., 2020a), we note some essential information for doctor modeling is missing, e.g., the profiles. In this work, we present a new dataset to study the characterization of doctor expertise on health forums from both profiles and dialogue history.
Data Collection. We developed an HTML crawler to obtain the data from Chunyu Yisheng, one of the biggest online health forums in China. Then, seed dialogues involving 98 doctors were gathered from the "Featured QA" page. To ensure doctor coverage in varying departments, we also collected doctors from the "Find Doctors" page for each department, which results in the 359 doctors in our dataset. Finally, for each doctor, we crawled their "Favorable Dialogues" page and obtained the profile and history dialogues therein. All stop words were removed from each dialogue.
Data Analysis. The statistics of our dataset are reported in Table 1. We observe that dialogues are in general much longer than profiles. We also observe that a doctor engages in over 300 dialogues on average. It indicates that rich information are contained in dialogues to learn doctor expertise, while presenting challenges to capture the essential content therein for effective doctor embedding. We further plot the distribution of dialogues a doctor engages and the dialogue length distribution in Figure 2. It is observed that doctors contribute diverse amounts of dialogues, which reflects the wide range of doctor expertise and qualifications in practice. Nonetheless, a large proportion of doctors are involved in over 100 dialogues while many dialogues are lengthy (with over 200 tokens). We can hence envision a doctor's expertise may exhibit diverse aspects and dense information is available in history dialogues, whereas an effective mechanism should be adopted to capture salient content.
We finally examine doctors' language styles by counting the number of medical terms based on THUOCL medical lexicon. 4 Results show that Figure 2: On the left subfigure, its y-axis shows the number of doctors and x-axis the dialogue number a doctor is involved in. For the right subfigure, the y-axis indicates the dialogue numbers in thousands (k) and xaxis the dialogue length in token number. medical terms take 30.13% of tokens in doctor profiles, while the number is 7.83% and 5.52% for patient and doctor turns in dialogues, respectively. It is probably because doctors tend to profile themselves with professional language while adopting layman's language to discuss with patients.

Doctor Recommendation Framework
We now introduce the proposed framework for our doctor recommendation task (overviewed in Figure 3). It contains three modules: a query encoder that encodes patient needs from queries, a doctor encoder that encodes doctor expertise from profiles and dialogues, and a prediction layer that couples above outputs for recommendation prediction. Figure 3: Overview of our framework. The doctor encoder first has its embedding layer (pre-trained BERT) fine-tuned via self-learning. It then employs profileaware multi-head attention over dialogues to explore doctor expertise and works with the query encoder (to capture patient needs) to pair doctors with queries.
Model's Input and Output. The input of our model is from three sources: a query q from a patient, the profile p i of doctor D i , and a collection of D i 's history dialogues d i 1 , d i 2 , ..., d in (i n denotes the number of dialogues D i previously engaged). For each given query q, we first pair it with each doctor D i from a candidate pool of m doctors and output a matching score s i to reflect how likely D i owns the expertise to handle the request of q. A recommendation is then made for q by ranking all the doctor candidates based on these matching scores s i (i ∈ {1, ..., m}).

Doctor Encoder
Here we introduce how we encode embeddings for a doctor D to reflect their expertise, which starts with the embedding of their profile and dialogues.
Profile and Dialogue Embedding. Built upon the success of pre-trained models for language representation learning, we employ a pre-trained BERT (Devlin et al., 2018) to encode the profile p and obtain its rudimentary embedding e p . Likewise, for a dialogue d, we convert it into a token sequence via linking turns in chronological order and encode its semantic features with BERT, which yields the dialogue embedding e d .
Self-Learning. As analyzed in Section 2, doctor profiles are usually written in a professional language while dialogue language tends to be in layman's styles. To marry semantics of profiles and dialogues into the space, we design a self-learning task to predict whether a profile and a dialogue come from the same doctor, where random profiledoctor pairs are adopted as the negative samples. Then, the pre-trained BERT at doctor encoder's embedding layer is fine-tuned via tackling the selflearning task and shaping an initial understanding of how profiles are related to dialogues.
Multi-head Attention. We have shown in Figure 2 that a doctor may engage in massive amounts of dialogues, where only part of them may be relevant with a query. To allow models to attend to the salient information from the dense content provided by history dialogues, we put a profile-aware attention mechanism over dialogues. Here, multihead attention is selected because of its capabilities in capturing multiple key points. It potentially reflects the complicated nature of doctor expertise, which in practice would exhibit multiple aspects.
Concretely, the profile embedding e p is used to query and attend [e d 1 , e d 2 , . . . , e dn ] T (the dialogue embedding array) to both key and value argument: For the j-th head, these three arguments are then respectively transformed through the neural perceptions with learnable weight matrices W Q j , W K j , and W V j (Q for query, K for key, and V for value). Their outputs Q, K, and V jointly produce an intermediate doctor representation h j , which characterize a doctor's expertise from one perspective: where the Att(·) operation is defined as: Here dim is the dimension of key and value. The scaling factor 1 √ dim helps keep the softmax output away from regions with extremely small gradients.
Finally, to combine the learning results from multiple heads, outputs are concatenated altogether and transformed with a learnable matrix W O to obtain the final doctor embedding e D : Here l denotes the number of heads. The doctor embedding e D , carrying features indicating the doctor expertise of D, will then be coupled with the query encoder results for recommendation, which will later be described in the coming section.

Query Encoder and Prediction Layer
Then we describe how we measure the qualification of a doctor (embedded in e D ) to handle a query q.
Query Embedding. For anonymous reasons, only the linguistic signals in a query are available to encode a patient's request. Therefore, we adopt a similar strategy for the embedding of profiles and dialogues to customize the query encoder with a pre-trained BERT. The learned feature is denoted as a query embedding e q to represent patient needs.
Recommendation Prediction. Given a pair of doctor D and query q, the embedding results of doctor encoder e D and query encoder e q are coupled in the prediction layer for recommendation. We adopt a MLP architecture to measure the matching score s of the D-q pair, which indicates the likelihood of doctor D able to provide a suitable answer to query q and is calculated as following: Here σ denotes sigmoid activation function and W M LP (weights) and b M LP (bias) are trainable.

Training Processes
Our framework is based on the pre-trained BERT and then fine-tuned in the following two steps. The first is to fine-tune the embedding layer of doctor encoder (as described in Section 3.1). For the second, we fine-tune the entire framework by optimizing the weighted binary cross-entropy loss introduced in Zeng et al. (2020b): Here τ is the training set formed with doctor-query pairs andŝ D,q denotes the binary ground-truth labels, with 1 indicating D later responded to q while 0 the opposite. λ > 1 balances the weights of positive and negative samples in model training, where the model would weigh more on positive D-q pairs (D indeed handled q) because negative samples may be less reliable and affected by many unpredictable factors, e.g., a doctor is too busy at some time. Intuitively, this training objective encourages models to assign high matching scores s D,q to a doctor D who actually helped q.

Experimental Setup
We now describe the set up for our experiments.
Dataset Preprocessing and Split. To preprocess the data for non-neural models, we employed an open-source toolkit jieba for Chinese word segmentation. 5 For neural models, texts were tokenized with the attached toolkit of MC-BERT, a pre-trained BERT for biomedical language understanding , to be able to feed into BERT. 6 In the experiments, we maintained a vocabulary without stop words for dialogues' nonquery turns while keeping them in queries and profiles, considering the high information density of the latter and colloquial styles of the former. In terms of dataset split, 80% dialogues were randomly selected from each doctor to form the training set. For the rest 20% dialogues, we took their first turns (patient query) to measure recommendation and split the queries into two random halves, one for validation and the other for test. In the training stage, we adopted negative sampling with a sampling ratio of 10 to speed up the process while for inference, the doctor ranking is conducted on the top 100 doctors handling the most queries.
Model Settings. As discussed above, the pretrained MC-BERT was employed to encode the queries, profiles, and dialogues, whose parameters were first fine-tuned on the self-learning task, followed by a second fine-tuning step to tackle the doctor recommendation task with the other neural modules. The maximum input length of BERT is 512, and the dimension of all text embeddings from the output of MC-BERT is 768. The hyperparameters are tuned on validation results and the following presents the settings. The head number of multi-head attention is set to 6 and the tradeoff parameter λ = 5 (Eq. 6) to weigh more on positive samples. The MLP at the output side contains one hidden layer in size 256. For training, we employ the Adam optimizer with an initial learning rate of 0.008 and batch size 256. The entire training procedure is 50 epochs, with early stop strategy adopted and the parameter sets result in the lowest validation loss used for test.
Baselines and Comparisons. We first consider weak baselines that rank doctors (1) randomly (henceforth RANDOM), (2) by the frequency of queries they handled measured on the training dialogues (henceforth FREQUENCY), (3) by referring to the doctors who responded to K (in practice K is set to 20) nearest patient queries in the semantic space (henceforth KNN), (4) by the cosine similarity of profile and query embeddings yielded by the pre-trained MC-BERT (henceforth COS-SIM (P+Q)), and its counterpart matching dialogues and queries (henceforth COS-SIM (D+Q)). Then, a popular non-neural learning-to-rank baseline GBDT (Friedman, 2001) with TF-IDF features is adopted (henceforth GBDT).
For neural baselines, we compare with the MLP that simply matches query embeddings with profile embeddings (henceforth MLP (P+Q)), with dialogue embeddings (henceforth MLP (D+Q)), and with the average embeddings of profile and dialogue (henceforth MLP (P+D+Q)). 7 We also consider Deep Structured Semantic Models (DSSM (Huang et al., 2013)), a popular latent semantic model for semantic matching. In this work, the original encoding bag-of-words module in DSSM is replaced with BERT. The query embeddings are matched with profile embeddings (henceforth DSSM (BERT WITH P)) or the average embeddings of dialogues (henceforth DSSM (BERT WITH D)).
To further examine the effects of our attention design for doctor modeling in recommendation, we attend a doctor's history dialogues in aware of their profile with two popular alternatives -dot and concat attention (Luong et al., 2015) (the former is henceforth referred to as DOT-ATT and the latter CAT-ATT). They both went through a fine-tuning with the self-learning task before the training of recommendation to gain the initial view of how profiles and dialogues are related to each other. For comparison, we also experiment on our ablation based on multi-head attention without this selflearning step (henceforth MUL-ATT (W/O SL)).
At last, we examine the other two ablations that encode profiles only with a multi-head selfattention (henceforth MUL-ATT (W/O D)) and its counterpart fed with dialogues only (henceforth MUL-ATT (W/O P)). The full model is henceforth named as MUL-ATT (FULL).
For all models, we initialize them with three random seeds and average the results in three runs for the experimental report below.
Evaluation Metrics. Following the common practice (Zeng et al., 2020b;Zhang et al., 2021), the doctor recommendation results are evaluated with the popular information retrieval metrics: precision@N (P@N ), mean average precision (MAP), and ERR@N . In the experimental report, N is set to 1 for P@N and 5 for ERR@N , whereas similar trends hold for other possible numbers.

Experimental Results
In this section, we first present the main comparison results in Section 5.1. Then, we quantify the model sensitivity to queries, profiles, and dialogues in varying lengths in Section 5.2. Finally, Section 5.3 analyzes the effects of head number in validation performance, followed by a case study to interpret our superiority and error analysis to provide insights to future work. Table 2 reports the comparison results across different models. We draw the following observations. First, it may require deep semantics to match doctor expertise with patient needs, infeasible to rely on heuristic rules (e.g., frequency or similarity) or shallow features (e.g., TF-IDF) to well tackle the task. Second, compared to profile, dialogues may better indicate how likely a doctor can help a patient, probably because of the richer content therein and the closer language style to a query (as analyzed in Section 2). Third, although the profiles and dialogues may potentially collaborate to better characterize a doctor (than the individual work), effective methods should be employed to couple their effects as their writings vary in the styles.

Main Comparison Results
For models with multi-head attention, all of them yield better results than other attention counterparts. This may imply the fact doctor expertise might be multi-faceted and multi-head attention works well to capture such feature. We also notice a self multihead attention over profile performs much worse than other ablations. It is probably because profile content is very dense and may challenge multi-head attention in distinguishing various aspects therein.
In comparison to MUL-ATT (W/O SL), MUL-ATT (W/O P) (modeling doctors with dialogues only) and the results of our full model is almost twice better. This again demonstrates the challenges present by the diverse wording patterns of profile and dialogues and the self-learning step to fine-tune pre-trained BERT would largely help in aligning them into the same semantic space.

Quantitative Analyses
In Section 5.1, we have shown our model achieves a better performance compared to various baselines. In this section, we further quantify its performance in varying lengths of queries, dialogues, and profiles, and compare the full models' results with its two ablations MUL-ATT (W/O P) and (W/O SL)  -the first and second runner-up in Table 2. Afterwards, we provide the comparisons of model performance across different medical departments to examine the scenarios where patients are able to know which department they should go to.
Sensitivity to Query Length. Figure 4 shows the P@1 over varying lengths of patient queries. All models perform better for longer queries, owing to more content available to infer patient needs. Besides, our full model consistently outperforms its two ablations while showing a relatively smaller performance gain for longer queries compared to MUL-ATT (W/O P). A possible reason is: long queries may simplify the matching with doctors and dialogue content may be sufficient to handle recommendation, minoring the profile effects. Sensitivity to Dialogue Length. We then study the model sensitivity to the length of dialogues for doctor modeling and show the results in Figure 5. Dialogue length exhibits similar effects to query length, possibly because they contribute homogeneous features to understand doctor-patient match. After all, other patients' queries are part of the dialogues and involved in learning doctor expertise. Sensitivity to Profile Length. Furthermore, we quantify the profile length and display the models' P@1 in Figure 6. Here profile length exhibits different effects compared to query and dialogue length discussed above, where models suffer the performance drop for very long profiles, because of the potential noise therein hindering the collaboration with profiles and dialogues. Nevertheless, the selflearning step enables profiling language to blend in the colloquial embedding space of dialogues or queries, which hence presents more robust results. We observe more complicated effects compared to those from queries ( Figure 4) or dialogues ( Figure 5).

Comparisons of Model Performance over Varying Departments.
In the realistic practice, patients might have already known which department they should turn to before seeking help from doctors. To better study doctor recommendation in this scenario, here we examine the model performance within different medical departments in our data. We select 4 models with highest P@1 scores in the main experiment (Table 2) Figure 7. We observe for all 14 departments, our model has the best performance in 13 departments and achieves comparable results with the best model for the left department (otolaryngology). We also find all models exhibit varying performance when handling queries from different departments. It is related to departments' characteristics. For example, all models obtain low scores for Internal Medicine because of its significant overlap with others and the challenges to understand the needs from queries therein. Another factor is the imbalance of training data scale from each department. For instance, the training samples for Oncology, Surgery, Otolaryngology are much fewer than the average, resulting in the worse model performance on them.

Further Discussions
Analysis of Head Number. In Table 2, multihead attention shows the superiority to model doctors. We are hence interested in the effects of head numbers and vary them in validation set with the results shown in Table 3. It is seen that model performances first increase and then decrease, with 6 heads achieving the best performance. It indicates that head number reasonably affects model perfor-mance because it controls the granularity of aspects a model should capture to learn doctor expertise.

Case Study.
To interpret what is learned by multi-head attention we take the example in Figure 1 and analyze the attention map produced by 6 heads, where 4 of them attend to dialogue d 3 and the other 2 respectively highlights d 1 and d 2 .
Recall that d 1 , d 2 , and d 3 each reflects a different aspects of doctor expertise. To further probe into the attended content, we rank the words by the sum of attention weights assigned to a dialogue they occur in and show the top 5 medical terms in Table 4. It is observed that the heads vary in their focusing point, while all related to the queried symptom of "insomnia" and "muscle ache" and further contribute to a correct recommendation of a neurological expert. This again demonstrates the intricacy of doctor expertise and the capabilities of multi-head attention to well reflect such essence. More cases are shown in Appendix A to offer more insight of how our model recommends doctors.
Error Analysis. We observe two major error types of our model, one resulting from doctor modeling and the other from the query.   Figure 1. The medical terms are from the THUOCL lexicon used in Section 2.
For doctor modeling, we observe many errors come from the diverse quality of profiles. As we have shown in Figure 6, not all content from profiles is helpful. For example, some doctors tend to profile themselves generally from experience (e.g., how many years they worked) instead of the specific expertise (what they are good at). Future work should concern how to further distinguish profile quality to learn doctor expertise.
In real world, some doctors are skilled comprehensively while others are more specialized. It causes the models tend to recommend the "Jack of all trades" rather than a more relevant doctor, as the former usually engaged in more dialogues and it is safer to choose them. For example, in a query concerning "continuous eye blinking", the model recommends a doctor with 100 "eyes"-related dialogues instead of the one specialized in "Hordeolum" and "Conjunctivitis" yet involved in only 30 dialogues. To mitigate such bias, it would be interesting to employ data augmentation (Zhang et al., 2020b) to "enrich" the history for doctors handling relatively fewer queries.
In terms of queries, many patients are observed to describe their symptoms with minutiae rather than focusing on the key points. So the model, lacking professional knowledge, may consequently be trapped with these unimportant details. For instance, a patient queried a "pimple" on the "eyelid"; the model wrongly attends to "eyelid" thus recommends an ophthalmologist but not a dermatologist to solve the "pimple" problem. A future direction to tackle this issue is to exploit knowledge from medical domains (Liu et al., 2020a) to allow a better understanding of patient needs.

Related Work
Our work is in the research line of recommender systems widely studied because of their practical value in industry . For example, previous work explores users' chatting history to recommend conversations (Zeng et al., 2018(Zeng et al., , 2020b and hashtags (Li et al., 2016;Zhang et al., 2021), browsing history to recommend news (Wu et al., 2019;Qi et al., 2021), and purchase history to recommend products (Guo et al., 2020). In contrast to most recommendation studies focusing on exploiting target users' personal interest modeling from their history behavior, our work largely relies on wordings of a short query to figure out what is needed by a target user (patient) because they are anonymous for privacy concern.
Within several branches of recommendation research, our task is by concept similar to expert recommendation for question answering Nikzad-Khasmakhi et al., 2019). In this field, many previous studies encode expertise knowledge in diverse streams, such as software engineering (Bhat et al., 2018), social activities (Bok et al., 2021), etc. Nevertheless, few of them attempt to model expertise with NLP methods. On the contrary, language representations play an important role here to tackle our task: we substantially explore how semantic features help characterize doctor expertise, which has not been studied before.
Our work is also related to the previous language understanding research over doctor-patient dialogues on online health forums (Zeng et al., 2020a), where various compelling applications are explored, such as information extraction (Ramponi et al., 2020;Du et al., 2019;Zhang et al., 2020c), question answering (Pampari et al., 2018;Xu et al., 2019), and medical report generation (Enarvi et al., 2020). In comparison with them, we concern doctor expertise and characterize it from both doctor profiles and the past patient-doctor dialogues, which is a gap in previous work filled in this work.

Conclusion
This paper has studied doctor recommendation in online health forums. We have explored the effects of doctor profiles and history dialogues in the learning of doctor expertise through a self-learning task and a multi-head attention mechanism. Substantial experiments on a large-scale Chinese dataset demonstrate the effectiveness of our method.

Ethical Considerations
It should be mentioned that all data, including doctors' profiles, patients' queries, and doctor-patient dialogues, are collected from the openly accessible online health forum Chunyu Yisheng whose owners make such information visible to the public (while anonymizing patients). Our dataset is collected by a crawler within the constraints of the forum. Apart from the personal information de-identified by the forum officially, to prevent privacy leaks, we manually reviewed the collected data and deleted sensitive messages. Additionally, we replaced each doctor's name with a unique code randomly generated to distinguish them while protecting their privacy. We ensure there is no identifiable or offensive information in the released dataset.
The dataset, approach, and model proposed in this paper are for research purposes only and intended to facilitate studies of using NLP methods for doctor expertise learning and recommendation to allow a better user experience on online health forums. We also anticipate they could advance other NLP researches like question answering (QA) in the biomedical domain. Yuanzhe Zhang, Zhongtao Jiang, Tao Zhang, Shiwan Liu, Jiarun Cao, Kang Liu, Shengping Liu, and Jun Zhao. 2020c

A More Case Study Results
To provide more insight of why our model can exhibit superior performance, we further discuss two more cases to understand how the multi-head attention mechanism makes use of the information from both the doctors' profiles and their history dialogues, in addition to example cases shown in Figure 1 and Table 4. Because a dialogue is mostly lengthy (as shown in Table 1), we only show the dialogue snippets in English translations for a better display (while the model is fed with the entire dialogues in the experiments).
We present in Table 5 a case sampled from the Department of Gynecology. As can be seen, the profile of the doctor is short, while the attended dialogues provide detailed information for the symptoms, treatments, and medicine. The top 5 keywords identified by the sum of attention weights for each head are shown in Table 5(b).While several heads seem to attend to one or two specific tokens, for example head 1, 4, and 5 attend to the token "menstruation", we observe each head has its own focus. For example, it is reasonable to infer that head 1 concerns messages related to the preparation of pregnancy, head 4 irregular period, and head 5 prognosis of abortion. Table 6 shows another example sampled from the Department of Dermatology. In this case, the doctor's profile is more detailed while generic. Top 5 keywords for each attention head are shown in Table 6(b). Similar to the observation from Table  5, the token "pruritus" occurs in most attended keywords of 5 heads for that it is one of the most common symptoms, whereas each head focuses on different aspects related to the query.  Table 5: (a) The sample patient query q from anonymous patient P on the top, followed by the profile of a sample doctor D and four dialogues D engaged before. u P refers to utterances of P , and u D utterances of D. (b) The top 5 medical terms attended by each head given the input sample in Table 5(a). The medical terms are from the THUOCL lexicon in Section 2.
Query q from Anonymous Patient P In the past week, she has been keeping saying that her back, her legs, and her whole body were all itchy. I observe she has a few dry eczema spots on her body, a little wrinkled and peeling. Profile p of Doctor D Good at treating common skin diseases, including diagnosis and treatment of acne, urticaria, viral warts, eczema, shingles, etc. Attended Dialogue d1 uP : I've had beriberi for over a year. At night, I feel itchy and the skin of my feet peels off. uD: I suggest you apply topical antifungal ointment to your feet and wash socks with boiled water every day. It takes 4-6 weeks to cure tinea pedis. Attended Dialogue d2 uP : It's red and itchy around my mouth and nose. What's the matter with me? uD: Are they blisters or pimples? You possibly got seborrheic dermatitis. uP : My husband has beriberi, is it possible I'm infected by him? uD: Not likely. Attended Dialogue d3 uP : I froze this spot, is it going to scab and peel off? uD: It's already dark red, so theoretically it should soon peel off. uP : It's nearly fourteen days after freeze, can I bath now? uD: You could shower but should not bath. Be careful not to irritate this spot. Attended Dialogue d4 uP : I have nail fungus, and I felt itchy after I applied ciclopirox amine cream the day before yesterday. Today I observe my toes swell. uD: There is a possible delayed allergic reaction to the drug. I suggest you rinse your toes with warm water and stop applying that cream.
(a)  Table 6: (a) The sample patient query q from anonymous patient P on the top, followed by the profile of a sample doctor D and four dialogues D engaged before. u P refers to utterances of P , and u D utterances of D. (b) The top 5 medical terms attended by each head given the input sample in Table 6(a). The medical terms are from the THUOCL lexicon in Section 2.