Entity Enhancement for Implicit Discourse Relation Classification in the Biomedical Domain

Implicit discourse relation classification is a challenging task, in particular when the text domain is different from the standard Penn Discourse Treebank (PDTB; Prasad et al., 2008) training corpus domain (Wall Street Journal in 1990s). We here tackle the task of implicit discourse relation classification on the biomedical domain, for which the Biomedical Discourse Relation Bank (BioDRB; Prasad et al., 2011) is available. We show that entity information can be used to improve discourse relational argument representation. In a first step, we show that explicitly marked instances that are content-wise similar to the target relations can be used to achieve good performance in the cross-domain setting using a simple unsupervised voting pipeline. As a further step, we show that with the linked entity information from the first step, a transformer which is augmented with entity-related information (KBERT; Liu et al., 2020) sets the new state of the art performance on the dataset, outperforming the large pre-trained BioBERT (Lee et al., 2020) model by 2% points.


Introduction
Discourse relation classification (DRC) involves automatically inferring the logical link between different text segments (such as causal, contrastive, temporal etc.). It has been shown to be a valuable preprocessing step to many downstream natural language processing tasks such as machine translation (Guzmán et al., 2014;Meyer et al., 2015), text summarization (Gerani et al., 2014) and questionanswering (Jansen et al., 2014). A main obstacle to a wider usage of automatic DR classifiers however lies in getting the classifiers to work reliably on domains other than the WSJ, that discourse relation parsers are usually trained on PDTB (Prasad et al., 2008) and RST (Carlson et al., 2003).
Moving to a different domain is particularly challenging in DRC because the overall distribution of relations typically differs between domains, and because many of the content words that classifiers may rely on are very different between domains. We here focus on the most challenging subtask of implicit discourse relation classification, which involves classifying those relations that are not linked by any explicit connectives like "because" or "but". In order to correctly recognize implicit relations, the classifier needs to recognize subtle surface cues (which may differ between domains) and learn about typical content-related relations. For instance, from the example "it's hot outside, therefore I'd like to eat an icecream", the words "hot outside" and "icecream" are relevant cues for the relation. An overview of typical cues for determining a coherence relation is provided in Das and Taboada (2018).
The key to improving automatic DRC on a new domain hence consists of better encoding of the discourse relational arguments. As we will show below (in line with earlier findings by Shi and Demberg, 2019b), it makes a big difference to have at least a small amount of in-domain discourse annotated data.
We here explore DRC on the biomedical domain, which seems particularly suitable because a discourse-annotated corpus is available (BioDRB; Prasad et al., 2011), which we can use for evaluation, as well as a setting with a small amount of indomain training data. Furthermore, the biomedical domain does have large raw text corpora available. An example instance from BioDRB (Prasad et al., 2011) is shown below: 1. [These abnormalities in active RA are thought to be induced mainly after chronic exposure to high concentrations of IL-6.] Arg1 (Implicit=thus) [The limited efficacy of IL-10 treatment of RA patients may be explained in part by the unresponsiveness to IL-10 of inflammatory cells, including T cells .] Arg2 -Implicit, Contingency.Cause Scientific texts such as those from the biomedical domain are well known to express much of the content in nominal phrases, and less in verb phrases (Halliday, 2006). Concretely, for the above example, understanding the relation between the RA (Rheumatoid Arthritis) and inflammatory cells (including T cells) is important to correctly understanding the relation. The high importance of entities in these texts is a crucial insight on which we base our approach.
In this paper, we first propose an unsupervised method using information retrieval and knowledge graph techniques for identifying text passages that are similar content-wise to the coherence relation we want to label. The underlying assumption here is that if two instances share the same entities in both the relational arguments, it is possible that they have the same or a similar discourse relation. This part of the method is applicable to any domain for which large amounts of in-domain text are available, but no in-domain discourse relation annotations. We find that this method helps to improve results substantially compared to a Bi-LSTM baseline model, but doesn't reach state of the art performance (which is set by transformer models).
We therefore proceed to enrich a transformer model with the knowledge extracted from the unlabelled texts, using the K-BERT model (Liu et al., 2020). The model is fine-tuned on the discourseannotated in-domain BioDRB data. We show that this setting sets the new state of the art on discourse relation classification on the biomedical domain, achieving an accuracy of 69.57%.

Related Work
Early approaches on BioDRB use probabilistic classifiers such as Naïve Bayes, Maximum Entropy, etc. to predict the relation (Xu et al., 2012). Bai and Zhao (2018) combine representations from different types of embeddings including contextualized word vectors from ELMo (Peters et al., 2018) and achieve 55.9% accuracy on BioDRB for in-domain training, and 29.52% in the cross-domain setting (reported in Shi and Demberg (2019b)). Shi and Demberg (2019b) also explore the performance of BERT (Devlin et al., 2019) models on the DRC task on BioDRB using cross-domain (fine-tuning on PDTB, testing on BioDRB) as well as in-domain (fine-tuning on BioDRB and testing on BioDRB) settings. They find a very good performance of the BERT model, which they attribute to its "next sentence prediction" task in pre-training.
Comparing the original BERT model to BioBERT (Lee et al., 2020), which was trained on biomedical text, they however find that BioBERT has only a limited ability for learning domain specific representations: Cross-domain performance is no better than for the BERT model, and in-domain performance improvements are moderate at only 1.5% points. Given that the entities play an important role in inferring implicit discourse relation in scientific texts, putting an emphasis on entities seems vital for achieving further improvements.
In contrast with previous studies that (largely unsuccessfully) attempted to train on explicit discourse relations for learning to classify implicit classifiers in supervised ways, such as Marcu and Echihabi (2002); Sporleder and Lascarides (2008) 2017) etc., we here propose an unsupervised voting pipeline and achieve good performance even comparing with supervised models like BERT and BioBERT. We believe that the key difference lies in the fact that previous methods tried to learn surface cues from explicit relations and tried to use them for implicits (which does not work, because these features differ between explicits and implicits, see e.g., Sporleder and Lascarides (2008); Asr and Demberg (2012)), while our method focuses on the content of the discourse relational arguments.

Unsupervised Method with Information Retrieval System
The successful usage of a memory network in Shi and Demberg (2019a) showed that instances that share the same relation have close representations. We believe that for sparse data like BioDRB, which has only around 2,000 labeled implicit instances in total, it is essential to use similar explicit instances to help find the latent patterns they share. In this section, we introduce an unsupervised method for implicit DRC, which is inspired by a recent information retrieval method. The core idea is as follows: we use information retrieval methods to identify explicitly marked coherence relations from the corpus which are content-wise similar to the relation we want to la-bel. We then automatically label these explicitly marked instances (relying on the high DRC accuracy of ca. 96% for explicit relations) and assign the majority label from the explicit instances to the implicit instance from our test set.
3.1 Retrieval of similar instances from a large corpus Figure 1 illustrates the overall pipeline of the proposed method. First, each instance from BioDRB (Prasad et al., 2011) is seen as a query and fed into the PubMed 1 and PMC 2 databases.
PubMed and PMC are free full-text archives of biomedical and life sciences journal literature at NIH National Library of Medicine. The database we use here is a corpus created from a subset of the whole PubMed and PMC collections, consisting of 7,079 documents in total (1,376 for pubMed and 5,703 for PMC).
With the query and candidate documents, we employ TF-IDF to extract the top 10 relevant documents. The candidate documents are then fed into a discourse parser; we here use the PDTB-style endto-end parser by Lin et al. (2014). The outputs of the parser contain the two arguments, the explicit discourse connective and a discourse relation label.
The Quasi Knowledge Graphs System, proposed by Lu et al. (2019), is designed to answer complex questions. It is a novel method that computes answers by dynamically building up a knowledge graph that fits the query. It consists of several steps including the extraction of subject-predicate-object (SPO) triples, knowledge graph construction, and a graph algorithm. We here only use the first step from this pipeline, extracting SPO triples, and actually only use the subject and object, not the predicate, to match with the noun phrases in the query. For example, from the relation instance in Example 1 above, the system would extract SPO triples (NETosis, enhanced in, RA) and (autoantibodies, known risk factors for, RA), from which we further employ only NETosis, RA; autoantibodies, RA.
After extracting the SPO triples from all the explicit discourse instances, we employ two types of matching strategies to connect them with the query: (i) Hard matching, which means that if the subject or object appear in the query, we count it as a vote. (ii) Soft matching. We find that with the hard matching, lots of positive samples have been filtered out and very few explicit instances are identified. Therefore, we use the cosine similarity between the subject or object and the noun phrases in the query, to detect similar entities. Cosine similarities are estimated based on the BioBERT encoding of the entities. We define a threshold for deciding when an explicit instance is similar enough to be counted as a valid vote or not. It is seen in the training phase as a hyper-parameter to be fine-tuned on the validation set. This method for detecting similar explicit instances is also used in our second approach described in Section 4. With the steps described above, eventually each query has been connected to a number of similar explicit instances and the prediction for the query is the majority vote from all of them with their explicit discourse sense labels.

Experiments and results
On average 813.99 explicit instances are extracted for each query. With the hard matching, 7.91 similar entities are matched with the Subject or the Object in the query. For the soft matching, we randomly choose 10% of the total instances acting as validation set in order to help set the threshold for the cosine similarity score.
The experimental results are shown in Table 1. We compare the results with related work by Bai and Zhao (2018) as well as several models reported in Shi and Demberg (2019b).
Our proposed unsupervised method achieves an accuracy of 35.29% with hard-matching and 41.95% with soft-matching. These results outper-  form other non-transformer approaches by a large margin. Comparing the hard and soft matching variants, our results show that identifying instances with similar entities leads to a larger set of relevant documents, which then help to increase robustness in the majority vote. The table also shows that the approach almost reaches the performance of recent very strong transformer models: the BERT model achieves a performance of 44.79% accuracy in the cross-domain setting (Shi and Demberg, 2019b).
The approach proposed here could be further refined by using better argument representations than simple matching of subject and object entities, and by learning the classification decisions instead of using simple majority voting, and by moving to transformer architectures. Our second approach addresses these points by employing a transformer architecture which can take the SPO triple information into account for more richly encoding the relational arguments.

DRC with an entity-augmented transformer
Integrating external domain-specific knowledge into the model is beneficial for this task has been found by Kishimoto et al. (2018), who integrated the ConceptNet relations as additional knowledge into the LSTM network and achieved better performance on the PDTB. We here aim to explore whether model performance can be further improved by exploiting richer entity representations in specialized texts like the biomedical domain. The pipeline with softmatching proposed in the above section provides us with SPO triples from related documents for each implicit relation instance in the test set. We here employ the recently proposed Knowledgeenabled Language Representation model (Liu et al., 2020, K-BERT) to integrate the external entity knowledge into the pre-trained language model for better argument representations.

K-BERT
Due to the domain gap between the pre-training and fine-tuning, unsupervised language models (such as BERT etc.) do not perform well on knowledgedriven tasks (Liu et al., 2020). Integrating domain specific knowledge into pre-trained model can alleviate this problem. However, the process of knowledge acquisition can be inefficient and expensive.
In order to tackle the heterogeneous embedding space and knowledge noise problems, Liu et al. (2020) proposed a Knowledge-enabled Bidirectional Encoder Representation from Transformers (K-BERT), as illustrated in Figure 2. With the knowledge layer and the external knowledge graph, the input sentence has been expanded into a sentence tree, which is then fed into into the embedding layer and the "seeing" layer. The seeing layer controls when the model has access to the original sentence and when it has access to the additional information.
However, knowledge graphs are not available for all domains. We therefore here replace information from the knowledge graph with the SPO triples extracted from related raw texts. Compared to a general knowledge graph, our extracted SPO triples have attached more importance on the discourse relations since that they are extracted from the explicit instances, and are specifically selected to be on-topic. For each input sentence, we attach the top 2 (default number from the K-BERT) similar SPO triples to the entities and convert it into a sentence tree. We train K-BERT on the BioDRB as a classification task. The input sequence of the Example 1 is shown below, where the words in italics are the linked entities.
2. These abnormalities in active NETosis enhanced in autoantibodies known risk factors for RA result in Neutrophil Chemotaxis are thought to be induced mainly after chronic exposure to high concentrations of IL-6. The limited efficacy of IL-10 treatment of RA patients reduced complement activation may be explained in part by the unresponsiveness to IL-10 of inflammatory cells, including T cells isolated from CTCL patient.
The whole sentence tree has been flattened into a sequence with the position index. The visible matrix is generated to keep the interactions of each of the tokens within the original sentence and also inside the knowledge graph triples. The visible matrix controls the self-attention layers in the transformer not to look into tokens other than the corresponding entities.

Experiments and Results
The experimental results are illustrated in Table 2.
We compare the results with the previous state of the art on the BioDRB dataset (Shi and Demberg, 2019b). K-BERT, which is initialized with the original BERT parameters, achieves 69.57% accuracy and outperforms BERT without entity augmentation by 6.5% points, and the the gigantic in-domain continuously pre-trained BioBERT by around 2%.
In addition, we tried to remove the relevant entities. The model then performed similar to the basic BERT, which is consistent with the results reported in Liu et al. (2020). These results confirm that adding related entities improves argument encoding and help improve the DRC task.

Conclusion
In this paper, we address the task of implicit discourse relation classification on BioDRB in the biomedical domain. Due to the importance of entities in scientific text, we decided to address this problem by identifying explicitly marked relations containing the same instances, and using a simple majority voting system. While this setting showed good performance in the unsupervised setting, much better results are achieved when at least a small amount of labelled data is available. We show that when a transformer model is augmented with entity information from the domain, the previous state of the art on the task is exceeded by 2% points.