Dialogue Medical Information Extraction with Medical-Item Graph and Dialogue-Status Enriched Representation

,


Introduction
The digitization of medical systems over the past decade has accumulated massive medical data.Among various medical data, the multi-turn doctorpatient dialogue contains rich medical knowledge, like the suggested medical test according to patient symptom descriptions, the recommended medication according to doctor's diagnosis results, etc.If such medical knowledge can be properly extracted and represented, it can enable the development of a large range of clinical applications, like disease  diagnosis assistance and medication recommendation.To this end, we target a critical task: the Dialogue Medical Information Extraction (DMIE).Given a multi-turn doctor-patient dialogue, the DMIE task aims to label this dialogue with predefined medical items and their corresponding statuses.We use an example from the MIEACL data set1 (Zhang et al., 2020) to illustrate the DMIE task.As shown in Figure 1, this dialogue receives five labels in total.Each label consists of three sub-components: Category, Item and Status, and their combination provides detailed medical descriptions.As shown in Table 1, all items are grouped into four types of categories: Symptom, Surgery, Test and Other info.And the Other info include mental state, sleep situation, whether drinking and etc.In addition to items and their categories, the MIEACL dataset also defined five statuses to provide a fine-grained state for each item.For example, for an item whose category is Symptom, the status Patient-Positive indicates that this patient mentions he/she has this symptom, and the status Doctor-Negative indicates that the doctor denies this symptom in the dialogue.For an item with a category of Test, the status Doctor-Positive indicates that the doctor recommends the patient to take this test.For example, in Figure 1, Label 4 (Test-Electrocardiogram:Patient-Neg) indicates that the patient mentioned he/she did not have an electrocardiogram test before.It is worth mentioning that the statuses of an item in DMIE tasks are not mutually exclusive, like Label 4 and Label 5, an item can have two statuses at the same time.
Most existing approaches either focus on recognizing a subset of items or directly formulate DMIE as a classification task.Among existing DMIE works, Lin et al. (Lin et al., 2019) and Du et al. (Du et al., 2019) only exploited the symptom items.They first recognized symptom terms from dialogues and then classified their corresponding statuses; Zhang et al. (Zhang et al., 2020) proposed a deep neural matching model which is actually a multi-label text classification (MLTC) model.Each combination of item and status refers to an independent class label.Recently, Li et al. (Li et al., 2021b) modeled the extraction as a generation process, and developed a multi-granularity transformer which can effectively capture the interaction between roleenhanced crossturns and integrate representations of mixed granularity.Xia et al. (Xia et al., 2022) proposed a speaker-aware co-attention framework to address the intricate interactions among different utterances and the correlations between utterances and candidate items.All aforementioned methods disregard the relations among items as well as the relations between items and statuses.
In this paper, we propose a novel approach to model 1) the relations among items and 2) the relations between items and statuses in the DMIE task.Firstly, we adopt a pre-trained language model to learn representations for dialogues, items and statuses respectively; Secondly, we introduce Finally, we use the joint representation which combines the item, status and dialogue representations as well as their mutual relations to predict the set of proper labels for this dialogue.We evaluate our approach on the MIEACL benchmark, and the results show that the proposed model outperforms all baselines.Further ablation study shows the effectiveness of our framework to model the diverse relationship among items, dialogues and statuses.Overall, the main contributions are as follows: • We propose a novel model for the DMIE, which introduces the interactions among medical items, incorporates the doctor-patient dialogue information to item representations and explores the item-status relationships.
• Rather than directly formulating DMIE as a multi-label text classification task, we design a status-aware item refining module for the DMIE task, which effectively captures the relationship between items and statuses.They also presented a co-attention fusion network to combine the information from different utterances and handle the intricate interactions among them, as well as the correlations between the utterances and candidate items.
Existing DMIE approaches treat each medical item independently.As a result, the correlations among medical items are ignored in modeling.Different from the above works, we employ a heterogeneous graph to directly model the correlations between medical items.Furthermore, we introduce an item updating with dialogue-aware attention module to infer implicit mention of medical items from dialogues.

Multi-Label Text Classification
The research of Multi-Label Text Classification (MLTC) focuses on two topics: document representation learning and label correlation learning.For a better understanding of relations between text and different labels, researchers began to exploit the knowledge of label correlations.Seq2emo (Huang et al., 2021) used a bi-directional LSTM (Hochreiter and Schmidhuber, 1997) decoder to implicitly model the correlations among different emotions; CorNet (Xun et al., 2020) designed a computational unit which was actually an improved version of a fully connected layer to learn label correlations, enhancing raw label predictions with correlation knowledge and output augmented label predictions; LDGN (Ma et al., 2021) employed dual Graph Convolution Network (GCN) (Kipf and Welling, 2017) to model the complete and adaptive interactions among labels based on the statistics of label cooccurrence, and dynamically reconstruct the graph in a joint way; (Zhang et al., 2021) designed a multi-task model, in which two additional label co-occurrence prediction tasks were proposed to enhance label relevance feedback.
In addition, there are also some other methods to improve model performance.For example, LightXML (Jiang et al., 2021) introduced a probabilistic label tree based clustering layer and a fully connected dynamic negative sampling layer on text representations to improve accuracy while controlling computational complexity and model sizes.For other NLP tasks like slot filling, slot correlations is learned with a self-attention mechanism (Ye et al., 2021).However, slot filling aims to extract the slots to fill in parameters of the user's query which is different from the targeted MIE task in this paper (classify dialog into predefined item status combinations).
Although DMIE could be treated as a special multi-label text classification task, the aforementioned methods are not able to process status information, which is not included in the ordinary MLTC datasets.Thus, we design an item refining with status-aware attention module to learn the cross attention between statuses and items to generate the status enhanced item representation.

Problem Definition
Given a dialogue with multiple consecutive turns of conversations, the objective of the DMIE task is to detect all predefined items, with their corresponding categories and mentioning statuses.Following (Zhang et al., 2020), we formulate the DMIE task as follows.
Given a dialogue D, we need to predict the labels {y i,j } ∈ {0, 1} |I|×|S| , where I and S represent the item set and status set, respectively, and y i,j denotes whether the item i ∈ I with status j ∈ S is mentioned in this dialogue.Here |I| is the total number of predefined items, and |S| = 5 is the total number of statuses, i.e., Patient-Positive, Patient-Negative, Doctor-Positive, Doctor-Negative and Unknown.These items can be grouped with a set of C categories, with a mapping τ (•) where τ (i) ∈ C assigns the corresponding category, i.e., Symptom, Surgery, Test or Other info to the item i.

Dialogue, Item and Status Encoder
We denote a dialogue D = {u 1 , u 2 , . . ., u n } with n utterances, where u i = {w i 1 , w i 2 , . . ., w i m i }, and w i j is the j-th token in i-th utterance in the dialogue.We treat D as a long sequence (|D| tokens, |D| = n i m i ) and feed it to the pre-trained language model to generate the contextualized representations For pre-defined items and statuses, we feed their text descriptions into a pre-trained language model then use the mean-pooling of the last layer outputs to initialize their latent features

Item Interactive Heterogeneous Graph Convolution
Items in DMIE tasks belong to four categories: Symptom, Surgery, Test and Other info, and the prior information between item pairs can be defined by the relationships between their categories.In Figure 1, when the patient said that he/she had the symptom of frequent ventricular premature beats, the doctor advised him to perform electrocardiograms testing and radiofrequency ablation surgery.
As a result, we could observe a "Symptom-Test" relationship between frequent ventricular premature beats and electrocardiograms, and a "Symptom-Surgery" relationship between frequent ventricular premature beats and radiofrequency ablation.Such co-occurrence relationship between items is one prior knowledge that can be used in the DMIE task.
As shown in Figure 2, we designed an Item Interactive Heterogeneous Graph Convolution (IIHGC) module to explicitly model the correlation among items.

Graph Construction
We use an undirected heterogeneous graph to model item correlations with different types of nodes and edges.We establish the edges between items based on their categories and the frequency of their co-occurrence in the corpus.Formally, our graph is denoted as G = (V, E).Each node v ∈ V represents an item with a corresponding category as τ (v) ∈ C. Each edge e = (v i , v j ) ∈ E represents a co-occurrence between items v i and v j from different categories.For example, the co-occurrence of frequent ventricular premature beats and electrocardiogram in Figure 1 gives an edge ⟨"Frequent Ventricular Premature Beats", "Symptom-Test", "Electrocardiogram"⟩ in our heterogeneous graph.In addition, the co-occurrence patterns between item pairs obtained from training data could contain noisy information.To reduce noise, we remove all edges with the number of co-occurrences less than four.

Heterogeneous Graph Convolution
After We first decompose the entire heterogeneous graph into several homogeneous subgraphs according to the types of edge relationships, and then apply independent Graph Attention Network (GAT) (Velickovic et al., 2018) on each subgraph to extract higher-level aggregated features.The graph attention network of each subgraph refines the representation of nodes by aggregating and updating information from its neighbors via multi-head attention mechanism.For a given edge from node i to node j, the attention coefficient of the k-th head α k ij is calculated as follows: where d ′ is the dimension of the head attention, is the neighbors of node i and ∥ denotes the concatenation operation.
Next, the representation of the k-th head for node i is obtained by linear combination of attention coefficients and corresponding features: where σ is the activation function.
After that, we get the final informationaggregated output of node i in this subgraph by concatenating the generated K independent attention heads outputs: where ∥ represents concatenation and H i has a dimension of K × d ′ = d features.Finally, we aggregate messages from all subgraph with different relationships for each node: where H r i is the representation of item i calculated from the r-th subgraph and |R| is the total number of relationship categories included in the heterogeneous graph.
In our setting, the final output of node i, H HGC i , has a dimension of d features.

Item Updating with Dialogue-Aware Attention
In the DMIE task, each item corresponds to a normalized name.The content of the doctor-patient dialogue can help us to determine which item is mentioned, and its corresponding status.In this paper, we utilize item-dialogue cross attention mechanism to enrich the information of item representation for better classification, which is denoted as the Item Updating with Dialogue-Aware Attention (ID) module in Figure 2. We obtain the dialogue representation H D , item representation H I from the dialogue encoder and the item encoder, respectively.Then we update the item representation through the aforementioned Item Interactive Heterogeneous Graph module and obtain the representation H HGC .Next, in the ID module, we use the attention mechanism similar to the Transformer block (Vaswani et al., 2017) to model the relation between the dialogue and items: where Add&Norm, MultiHead and FFN are the standard Transformer operations (Vaswani et al., 2017).Since there is no autoregressive or causal relationship between items, we do not use attention masks in the self attention layer.

Item Refining with Status-Aware Attention
One major difference between the DMIE task and the conventional multi-label text classification task is that it needs to assign a status to each item.Therefore, in this paper, we design an Item Refining with Status-Aware Attention (IS) module in Figure 2, to enrich the representation of items.The IS module uses a mechanism similar to the ID module to refine the representation of items: where H IS is the refined status-aware item representation.Following the ID module, we also discard the attention mask in the self attention layer in the IS module.

Label Prediction and Optimization Objectives
After the above procedures, we obtain the pooled dialogue representation h d = MeanPooling(H D ), item embedding H I , item-graph interacted representation H HGC , dialogue-aware item representation H ID and status-aware item representation H IS .We predict the |S|-dimension status of item i by: where σ is the sigmoid activation function and W c ∈ R |S|×9d are the model parameters used for classification.
The proposed model is trained with the binary cross entropy loss: 4 Experiments

Dataset and Preprocess
We validate our model on the MIEACL benchmark from an online medical consultation website and proposed by (Zhang et al., 2020).(Zhang et al., 2020) divided each dialogue into multiple parts using a sliding window, the sliding step is 1 and the window size is 5.In this manner, a dialogue with k conversation turns will be split into k windows of size 5.In total, the 1,120 dialogues in MIEACL are divided into 18,212 windows.

Evaluation Metrics and Settings
Following the settings of (Zhang et al., 2020), we use the MIE-Macro Precision, MIE-Macro Recall and MIE-Macro F1 score as our main evaluation metrics, which produce the metrics over the average performance of each sliding window.It is worth mentioning that, different from ordinary Macro-Average metrics, the MIE-Macro Precision and MIE-Macro Recall is set to 1 by (Zhang et al., 2020) if the prediction is empty on a window with no labels.To reduce the impact of samples with no labels, we additionally use Micro-Average metrics for further comparison.For a fair comparison, we apply the same train/dev/test dataset split as (Zhang et al., 2020), and select the best performing model in the validation set for testing.
We utilize the Chinese Roberta (Cui et al., 2019) as the encoder in our model, which is a base version of the pre-trained Chinese RoBERTa-wwmext.The ID, IIHGC, and IS have the same dropout rate, attention heads, and hidden layer dimensions as the original RoBERTa model, which are 0.1, 12, and 768 respectively.We employ the AdamW optimizer for training our model, with a learning rate of 2e-5 for the parameters in the RoBERTa encoder and 2e-4 for the rest of the model.We incorporate an early stop mechanism to avoid overfitting, whereby the training is stopped if there is no improvement in the MIE-Macro F1 or MIE-Micro F1 on the validation set over 10 epochs.Running an experiment with these settings will roughly take 9 hours on a single NVidia V100 GPU.

Baseline Models
To demonstrate the effectiveness of our proposed model, we compare it with seven strong baselines: • RoBERTa CLS : RoBERTa CLS is a basic classification model which employs the top-level representation h CLS to perform medical information extraction.
• LightXML (Jiang et al., 2021): LightXML adopts end-to-end training, label clustering and dynamic negative label sampling to improve model performance.
• CorNet (Xun et al., 2020): CorNet adds an extra module to learn label correlations to enhance raw label predictions and output augmented label predictions.• MIE (Zhang et al., 2020): MIE focuses on learning an improving representation using a deep matching architecture that takes into account dialogue-turn interactions to capture the category-item pair information and the status information, and feed them both into a classifier for text classification.
• SAFE (Xia et al., 2022): SAFE is a speakeraware model that employs a co-attention fusion technique with multitask learning and graph networks.
• MGT (Li et al., 2021b): The MGT model proposes a Multi-Granularity Transformer to fully capture the interaction between roleenhanced cross-turns and integrate mixed granularity representations.

Results and Discussion
We report the experimental results of all comparing algorithms on the MIEACL dataset in Table 2. Our approach outperforms previous work (Zhang et al., 2020) on the DMIE task.The combination of IIHGC, ID, and IS achieves 82.54% MIE-Macro F1 and 81.62% Micro F1 scores, outperforming the previous work by 16.49% and 16.24% respectively.The performance gain is mainly due to the introduction of internal correlations between items and statuses.Our model not only introduces prior information between items, but also uses the self-attention mechanism to effectively update and refine item representations.Please note that since we need to calculate the Micro metrics in addition to the Marco metrics, for consistency consideration, we re-run the code of (Zhang et al., 2020) on the benchmark data set instead of directly citing the numbers reported (Zhang et al., 2020).We can also see that our proposed method outperforms the conventional Multi-Label Text Classification (MLTC) approaches (Jiang et al., 2021;Xun et al., 2020).This is mainly due to that the MLTC models do not consider the complex relationships between the conversation contents and items, as well as the corresponding relationships between items and status.Our model outperforms these models by more than 11% and 14% on MIE-Macro F1 score and Micro F1 score, respectively.
To confirm that the improvement does not merely come from the adoption of pre-trained language model, we compare our model with RoBERTa CLS .The performance of RoBERTa CLS shows that the adoption of a high-quality pre-trained model is helpful for the DMIE task.Nevertheless, our method surpasses this baseline by additional 7.19% and 8.9% on MIE-Macro F1 score and Micro F1 score, respectively, showing the effectiveness of the introduction of our IIHGC, ID and IS modules.

Results of Different Model Variants
We conduct an ablation study by gradually stripping components to examine the effectiveness of each component in our full model (IIHGC+ID+IS).The experiment results are illustrated in Table 3. Variant 1 is the most basic variant which only uses the dialogue representation and item embeddings for label prediction.Benefiting from the introduction of item information, the model Variant 1 has effectively improved the text classification based approach (RoBERTa CLS ).However, due to the lack of medical item correlation information, the reference relationship between dialogue and item, and the corresponding relationship between item and status, Variant 1 gets the worst performance among all variants.The performances of Variant [2−4] are better than Variant 1 , which proves the effectiveness of the introduced IIHGC, ID and IS components.A single ID module brings the largest improvement In Section 3.3, we use Graph Attention Network (GAT) introduced in (Velickovic et al., 2018) to model the relations among items.There are also other graph neural network approaches like (Corso et al., 2020;Chen et al., 2020;Li et al., 2021a).Here we compare the performance of the IIHGC equipping with different GNN models.As shown in Table 4, we can find that equipping with any of these four graph models, the proposed method consistently outperforms SOTA works.
We further explore the sensitivity of the model to random seed setting, and the results are presented in Figure 3.Here five different random seeds were used.We find that the full model performance remains stable under different random seeds and is consistently better than other model variants.To sum up, our proposed method can effectively improve the performance and is robust in training.

Results of Different Label Frequencies
Now we study the performance on labels with varied frequencies.Here a label means the combina- of an item and its specified status.As shown in Figure 4, the label distribution in the actual medical environment is naturally long tailed.Therefore, such long-tail bias may have impact on the performance of the model and its practicability.Here we divide all the labels into three groups according to their frequencies in the training set to explore the performance of the model: the major group (Group 1, F > 90), the mid-frequency group (Group 2, 20 < F ≤ 90), and the long-tail group (Group 3, 1 ≤ F ≤ 20).We calculate the accuracy of each label and add it into its corresponding group average accuracy.As shown in Table 5, we find that the performance of all methods gradually declines when the the label frequency becomes smaller, but the performance of the full model is constantly higher than all variants and decreases in a slower rate.In Group 3, we can see that the full model achieves 28.65% improvement compared with the baseline variant model.We speculate the reason is due to the incapability of the baseline variants to understand the semantics of labels, which restricts their performance.The full model not only generates fine-grained item representations for different dialogues, but also discovers the implicit semantics of long-tail labels.

Results of Different Dialogue Lengths
Medical conversations varied in length which is usually determined by the complexity of patients' diseases.The MIEACL dataset adopts a conven-template = """ # Predefined Information ## Symptoms -Difficulty in moving -Other 44 symptoms… ## Tests -Electrocardiogram -Other 15 Tests ## Surgery -Intervention -Other 3 Surgery ## General Information -Sleep -Other 5 General Information Instruction：Given a dialogue between a patient and a doctor, based on the predefined information above, extract the predefined "s ymptoms", "tests", "surgery", or other "general information" mentioned in the dialogue, and judge their "status".The definition and d escription of the status are as follows: -Doctor Positive: The doctor determines that the patient has a symptom, the situation corresponding to the general information exis ts, and a test or surgery is needed -Doctor Negative: The doctor determines that the patient does not have a symptom, the situation corresponding to the general info rmation does not exist, and no test or surgery is needed -Patient Positive: The patient indicates that they have a symptom, the situation corresponding to the general information exists, an d they have undergone a certain test or had a certain surgery -Patient Negative: The patient indicates that they do not have a symptom, the situation corresponding to the general information do es not exist, and they have not undergone a certain test or had a certain surgery -Unknown: It is mentioned in the dialogue, but it is not known which specific status it corresponds to Example: # Patient-Doctor Dialogue Doctor: Are you currently taking any oral medications for treatment?How long have you noticed this?How is your sleep?Patient: My sleep is very good.Discovered 2 years ago.It was found during a physical examination at the age of 19, more than 3 y ears ago Doctor: Have you ever been treated?Patient: This is last year's electrocardiogram, the doctor prescribed some medicine.I have been taking Danshen tablets.Sometime s I eat, Doctor: Your arrhythmia does not have much impact on fertility, it is recommended that you avoid fatigue and emotional excitement.You can take some Wenxin granules to improve symptoms.In addition, it is recommended that you check the electrocardiogram an d cardiac enzymes when you are not feeling well.both underperformed SOTA MIE works and the proposed IIHGC by a large margin.This is may be due to, for this particular MIE task, the supervised models are more suitable than the general LLM approaches.

Conclusions
We introduce a heterogeneous item graph to model item correlations, as well as two attention based modules to learn a dialogue-status enriched joint representation for Dialogue Medical Information Extraction (DMIE).Instead of formulating DMIE as an ordinary multi-label text classification problem, we consider the item-status relationship and model this relationship explicitly.Extensive experiments demonstrated that each component of our proposed model could bring performance gains to the baseline model, and that their combination further improved the result and achieved the stateof-the-art performance on the MIEACL benchmark.We also evaluate the performance of GPT-3.5-turbo and GPT-4.In future, we will attempt to include textual names of items and status in modeling.

Figure 1 :
Figure 1: An example of the DMIE task, translated from Chinese.Given the whole dialogue, five labels are extracted.Each label consists of a category, an item and its status.The label and the corresponding evidence in dialogue are highlighted with the same color.

Figure 2 :
Figure 2: The workflow of the proposed model, which consists of three main components: Item Interactive Heterogeneous Graph Convolution (IIHGC) module, Item Updating with Dialogue-Aware Attention (ID) module and Item Refining with Status-Aware (IS) module.

Figure 4 :
Figure 4: The distribution of medical item+status label frequency on the MIEACL benchmark dataset.

Figure 5 :
Figure 5: The One-Shot Prompt Template of the DMIE Task for GPT-3.5 and GPT-4

Table 1 :
Details of the labels in a DMIE dataset: MIEACL.
graph construction, we utilize a heterogeneous graph convolution module to model the item interactions, which is the IIHGC module in Figure 2. The IIHGC module takes the item representation H I = [h 1 , . . ., h |I| ] and the constructed graph as inputs and outputs high-level features H HGC .

Table 2 :
Results(%) on the MIEACL dataset, where P stands for Precision, R stands for Recall, and F1 is the standard F1 metric.IIHGC+ID+IS is our proposed model, which can be observed to outperform over all baselines.† indicates the results from original paper.

Table 3 :
Quantitative analysis(%) of our model with different variants.

Table 4 :
Results(%) on the MIEACL Dataset with Different GNN models