Multimodal Graph-based Transformer Framework for Biomedical Relation Extraction

The recent advancement of pre-trained Transformer models has propelled the development of effective text mining models across various biomedical tasks. However, these models are primarily learned on the textual data and often lack the domain knowledge of the entities to capture the context beyond the sentence. In this study, we introduced a novel framework that enables the model to learn multi-omnics biological information about entities (proteins) with the help of additional multi-modal cues like molecular structure. Towards this, rather developing modality-specific architectures, we devise a generalized and optimized graph based multi-modal learning mechanism that utilizes the GraphBERT model to encode the textual and molecular structure information and exploit the underlying features of various modalities to enable end-to-end learning. We evaluated our proposed method on ProteinProtein Interaction task from the biomedical corpus, where our proposed generalized approach is observed to be benefited by the additional domain-specific modality.

(PPI), where the relation ('interaction' or 'noninteraction') between two protein mentions is identified from the given biomedical text. The knowledge about protein interactions is critical in understanding the biological processes, such as signaling cascades, translations and metabolism, that are regulated by the interactions of proteins that alter proteins to modulate their stability (Elangovan et al., 2020).
Majority of the existing works on PPI in the literature primarily focused only on the textual information present in the biomedical article. However, these approaches lack in capturing (1) multiomnics biological information regarding protein interactions, and (2) genetic and structure information of the proteins. A few works (Dutta and Saha, 2020;Asada et al., 2018; have been reported in the literature where the researchers have considered different modalities of the biomedical corpus. However, these multi-modal architectures are modalityspecific and thus are very complex. Hence, there is a surge to develop a generalized and optimized model that can understand all the modalities rather than developing various architectures for different modalities. Towards this, we explore Graph-based Transformer model (GraphBERT)  to learn the modality independent graph representation. This enables the model to acquire the joint knowledge of both the modalities (textual and protein structure) under a single learning network. The main contributions of this work are: 1. Besides the textual information of the biomedical corpus, we have also utilized protein atomic structural information while identifying the protein interactions. 2. Developed a generalized modality-agnostic approach that is able to learn the feature repre-sentations of both the textual and the proteinstructural modality. 3. Our analysis reveals that addition of proteinstructure modality increases the efficiency of model in identifying the interacted protein mentions.
Related Work: Existing studies have adopted traditional statistical and graphical methods (Miyao et al., 2008;Chang et al., 2016) to identify the protein interactions from the textual content. Later, with the success of deep learning, several techniques based on Convolutional Neural Network (Choi, 2018;Peng and Lu, 2017;Ekbal et al., 2016), Recurrent Neural Network (Hsieh et al., 2017;Ahmed et al., 2019), Long Short Term Memory network Ningthoujam et al., 2019;Yadav et al., 2020), and language models (Yadav et al., 2021) based methods are proposed for extracting the relationships from biomedical literature and clinical records. Fei et al. (2020) proposed a span-graph neural model for jointly extracting overlapping entity relationships from biomedical text. The recent advancement of the Transformer model Beltagy et al., 2019) in the biomedical domain has also led to significant performance improvement in biomedical relation extraction task (Giles et al., 2020). Recently, the use of multi-modal dataset in BioNLP domain (Dutta and Saha, 2020;Asada et al., 2018) draws the attention of the researchers due to its better performance than the traditional approaches.
In contrast, our model is independent of handling multiple modalities without relying on modalityspecific architectures.

Proposed Method
In this section, we introduce our proposed method and its detailed implementation. The proposed deep multi-modal architecture is illustrated in Figure-1, that consists of four main components: (1) Multi-modal Graph Constructor, (2) Multimodal Graph Fusion, (3) Multi-modal Graph Encoder, (4) PPI Predictor. Below we briefly describe each of the model components.
Problem Statement: Given a biomedical input text S = {w 1 , w 2 , . . . , w n } having n words, and a pair of protein mentions p 1 , p 2 ∈ S, we aim to predict, whether the protein mentions will 'interact' or 'non-interact'.

Multi-modal Graph Constructor
This component consists of two distinct graph constructors for two different modalities, which are Textual Graph Constructor and Protein Structure Graph Constructor. The former, constructs the graph by considering the textual content that aims to capture the lexical and contextual information present in the input. The later, exploits the atomic structure (3D PDB structure) of the protein molecules to build the graph.
Textual Graph Constructor: To generate the textual graph, we begin by first constructing the vocabulary from the training corpus. For each input text S, we use one-hot-encoding mechanism to encode them as a vector representation R S ∈ R |V | . However, the representation R S suffers from the data sparsity as the vocabulary size can become very large for the entire training corpus. To deal with this, we utilized the Principal Component Analysis (PCA) (Wold et al., 1987) to reduce the vector dimensionality. The textual The link e i,j between nodesR S i andR S j is determined by the common entities (protein) present in both the sentences S i and S j , if there is no common entity, then link does not exist between the nodes. The edges E T = {e i,j | i, j ∈ V T , and protein ∈ i, j} are the set of all the links that exist between any two nodes in the graph, G T .
Protein Structure Graph Constructor: For the protein structural modality, we created a graph where each node represents an atom and the edge represents the connection between the atoms. To obtain the atomic information about the proteins, first we have mapped the proteins into genes and utilized the PDB (Protein Data Bank) 1 for each associated protein mention. Each protein information obtained from PDB consists of set of atoms {a 1 , a 2 , . . . , a A }, and a node feature matrix, N p ∈ R A×dp . The node feature matrix for each protein k undergoes the convolutional operation CNN(.) followed by the max-pooling operation, pool(.). Formally, P k = pool(relu(CNN(N p k ))). The final protein representation, P S i , for both the proteins present in the given input sentence S i is computed as follows: P S i = P 1 ⊕ P 2 . Following this, the protein structure graph where |N | is the number of input sentences in the training dataset and P S i ∈ R ds is the protein structure representation of size d s for sentence S i .

Multi-modal Graph Fusion
In this component, we fused the textual graph G T and protein structure graph G P with the aim of generating a joint representation that is capable of capturing the contextual, lexical, and multi-omnics information. Towards this, we expanded the node information of textual graph with the node information obtained in the protein-structure graph. Specifically, we created a multi-modal graph G with the nodes V having concatenated vector representations from the respective nodes of textual graph and protein structure graph. Formally, The link information remains intact in the multimodal graph fusion, thus, E = E T .

Multi-modal Graph Encoder
Majority of the existing works on multi-modal relation extraction have treated multiple modalities separately and exploited the modality-specific architectures to learn the respective feature representations. However, these strategies inhibit the learning of inherent shared complementary features, that are often present across the modalities.
To address this, we present an end-to-end multimodality learning mechanism that exploits the single expanded multi-modal graph (obtained from the Multi-modal Graph Expansion component) with the Graph-based Transformer encoder. Specifically, we utilized the Graph-BERT  encoder over the other dominants graph neural networks (GNNs) primarily due to its capability to avoid the (a) suspended animation problem (Zhang and Meng, 2019), and (b) over-smoothing problem (Li et al., 2018) that hinders the applications of GNNs for deep graph representation learning tasks. For a given multi-modal graph G = (V, E) with the set of nodes (V) and edges (E), Graph-BERT sampled set of graph batches for all the nodes as set G = {g 1 , g 2 , . . . , g |V| }. For all the nodes v j in subgraph g i , the Graph-BERT computes raw feature vector embedding e x j , role embedding e r j , position embedding e p j and distance embedding e d j . The initial input vector for node v j is computed as follows: h (0) j = e x j +e r j +e p j +e d j . Furthermore, the initial input vectors for all the nodes in g i can be organized into a matrix , where k is a hyper-parameter. The Graph-Transformer  computes the vector representation of D layers of transformers. The final feature (z i ) for node v j is computed as follows:

PPI Predictor
The final feature (z i ) of each node i is used to predict the PPI category. Towards this, we employed a feed-forward network with softmax activation layer to predict the input text into one of the two classes where, W and b are the weight matrix and bias vector, respectively. K denotes total number of distinct classes, which are 'interaction' and 'noninteraction' in our case.

Datasets and Experimental Analysis
Datasets: In this work, we have collected two exemplified multi-modal protein protein interaction datasets (Dutta and Saha, 2020). In these datasets, the authors exemplified two popular benchmark PPI corpora, namely BioInfer 2 and HPRD50 3 .

Experimental Setup
We have utilized the pretrained Graph-BERT 4 in our experiment. The initial vocabulary for BioInfer and HPRD50 datasets are 6561 and 1277, respectively. We have projected them into 1000 and 1185 dimension vectors using PCA, respectively. We have kept maximum of 5052 and 1185 number of words in both the datasets, respectively. The filter-size of CNN is set to 3, 4. We have obtained 1185 length node feature representation for protein structure graph. The nodes of multi-modal graph received the 2185 sized feature  Table 3: Comparative analysis of the proposed multimodal approach with other state-of-the-art approaches for HPRD50 dataset.
representation. We have obtained 2500 and 25859 number of nodes and edges from HPRD50 dataset and 13675 and 15930214 number of nodes and edges from BioInfer dataset for the Graph-BERT training, respectively. We have used all the hyperparameters of Graph-BERT model in our proposed model. We have kept following hyper-parameters values: subgraph size = 5, hidden size = 32, attention head number = 2, Transformer layers, D = 2, learning rate = 0.01, weight decay = 5e-4, hidden dropout rate = 0.5, attention dropout rate = 0.3, loss = cross entropy, optimizer = adam (Kingma and Ba, 2014). The hyper-parameters are chosen based on the 5-fold cross-validation experiments on both the datasets.

Results and Analysis:
We have compared the performance (c.f. Table-1,3) of our proposed model with the existing state-of-the-art methods on PPI for both the datasets. These existing methods are based on different techniques like kernel-based (Choi and Myaeng, 2010;Tikk et al., 2010;Qian and Zhou, 2012;Li et al., 2015), deep neural networkbased (Zhao et al., 2016;, multichannel dependency-based convolutional neural network model (Peng and Lu, 2017), semantic feature embedding (Choi, 2018), shortest dependency path (Hua and Quan, 2016) and a recent deep multimodal approach (Dutta and Saha, 2020). It is to be noted that our results on BioInfer and HPRD50 are not directly comparable with the existing approaches as other methods have utilized different test sets for evaluation. From the above comparative study, it is evident that our proposed multimodal approach identifies the protein interactions in an efficient way and can be further improved in different ways.
Discussion: To analyze the role of each modality, we conducted ablation study as shown in Table  2. We performed the experiments with the textual modality. Here, we could not consider the proteinstructural modality alone as it would bring the conflicting labeling relation. For example, consider two sentences that contain same pair of proteins but these proteins can have conflicting relations (interacting or non-interacting) depending on the context of sentences in which they appear. Hence, we could not consider the protein-structural modality alone. Though the structural modality is unable to draw any conclusion alone, however the integration of both the modalities demonstrates the improvements (3.78% and 1.8%, in terms of F-score for HPRD50 and BioInfer, respectively) over the textual modality alone.

Error Analysis
The comparative confusion matrices with only textual-modality and multi-modality for both the datasets are shown in Figure-2. We have performed error analysis to postulate possible reasons and areas with scope of improvement in our experiments. After careful study on false positive and false negative classes, following observations can be made. 1) Instances with a large number of protein mentions in a single sentence can cause misclassification. For example, the maximum number of proteins in any instances of BioInfer and HPRD50 datasets are 26 and 24, respectively. These large number of proteins present in a single instance may lead the network to misclassificaton. 2) Few samples contain repeated mentions of the same protein. This adds noise and might lead to losing useful contextual information.
3) To get a consistent graph from molecular structure, the nodes were required to be of the same length. This is done by padding the vectors with zeros, and when the PDB is not available, a null vector is used for consistency. A better handling of missing data will help in learning the proposed model.

Conclusion
This work presents a novel modality-agnostic Graph-based framework to identify the interactions between the proteins. Specifically, we explored two modalities: textual, and molecular structure that enable the model to learn the domain-specific multi-omnics information complementary with the task-specific contextual information. A detailed comparative results and analysis proves that our proposed multi-modal approach can capture underlying molecular structure information without relying on sophisticated modality-specific architectures. Future work aims at extending this study to the other related tasks like drug-drug interactions.