The impact of near domain transfer on biomedical named entity recognition

Current research in fully supervised biomedical named entity recognition (bioNER) is often conducted in a setting of low sample sizes. Whilst experimental results show strong performance in-domain it has been recognised that quality suffers when models are applied to heterogeneous text collections. However the causal factors have until now been uncertain. In this paper we describe a controlled experiment into near domain bias for two Medline corpora on hereditary diseases. Five strategies are employed for mitigating the impact of near domain transference including simple transfer-ence, pooling, stacking, class re-labeling and feature augmentation. We measure their effect on f-score performance against an in domain baseline. Stacking and feature augmentation mitigate f-score loss but do not necessarily result in superior performance except for selected classes. Simple pooling of data across domains failed to exploit size effects for most classes. We conclude that we can expect lower performance and higher annotation costs if we do not adequately compensate for the distributional dissimilarities of domains during learning.


Introduction
Model and feature selection are important experimental tasks in supervised machine learning for suggesting approaches that will generalise well on real world data. Research in biomedical named entity recognition (bioNER) often displays two features: (1) small samples of labeled data, and (2) an implicit assumption that the future data will be * collier@ebi.ac.uk drawn from a similar distribution to the labeled data and hence that minimising expected prediction error on held out data will minimise actual future loss. Since expert labeling is time consuming and expensive, labeled data sets tend to be relatively small, e.g. (Kim et al., 2003;Tanabe et al., 2005;Pyysalo et al., 2007), in the region of a few hundred or thousand Medline abstracts. Despite the danger of intrinsic idiosyncracies such corpora are often used to demonstrate putative prediction error across the heterogeneous collection of 22 million Medline abstracts. Once this assumption is made explicit it is of interest to both researchers and users that the implications and limitations of such experimental settings are explored.
Cross domain studies have indicated an advantage for mechanisms that compensate for domain bias. For fully supervised learning, which is the scenario we explore here, recent methods include: feature augmentation (Daumé III, 2007;Arnold et al., 2008;McClosky et al., 2010), instance weighting (Jiang and Zhai, 2007;Foster et al., 2010), schema harmonisation (Wang et al., 2010) and semi-supervised/lightly supervised approaches (Sagae and Tsujii, 2007;Liu et al., 2011;Pan et al., 2013). More generally there is a wide body of work in transfer learning (also known as domain adaptation) that tries to handle discrepancies between training and testing distributions (Pan and Yang, 2010).
As an illustration of near domain bias consider the list of high frequency named entities in Table 1 drawn from two sub-domains in the research literature of hereditary diseases. A domain expert in hereditary diseases would have no difficulty in dividing them into two non-overlapping sets corresponding to the two near domains with one term t 5 patients shared by both: {t 1 ,t 6 ,t 8 ,t 9 } and {t 2 ,t 3 ,t 4 ,t 7 ,t 10 }.
Previous studies have shown what happens when you radically change the domain and/or the t 1 rheumatoid t 6 human leukocyte arthritis antigen t 2 lupus t 7 coronary heart erythematosus disease t 3 leopard syndrome t 8 type 1 diabetes t 4 Omapatrilat t 9 T1D t 5 patients t 10 hypertension 1. We compare four data combination strategies for mitigating the impact of near domain transference and measure their effect on fscore performance against an in domain baseline.
2. We provide additional evidence for the effectiveness of (Daumé III, 2007)'s frustratingly simple strategy which provides both general and domain-specific features; in effect a joint learning model.
3. Expectedly, but not trivially, we show that a general loss of f-score occurs on bioNER when transfering to near domains. This loss is not uniform across all classes. We provide class-by-class drill down analysis to the underlying causal factors which make some entities more robust to near domain transference in biomedicine than others.
4. Our results challenge the notion that pooling small corpora, even when guideline differences are reconciled, leads to improved f-score performance (Wang et al., 2010;Wagholikar et al., 2013).
5. In addition to the usual biomedical entity types we introduce the class of phenotypes which are valued as indicators of genetic malfunction and characteristic of diseases. The phenotype class incorporates a complex dependency between classes, notably anatomical entities and genes.
This paper is organised as follows: Section 2 describes related work in cross domain transfer for biomedical NER, Section 3 discusses our approach including the two data sets used in our experiments, CRF model, feature choices and evaluation framework. In Section 4 we outline our experimental design. Finally in Section 5 we compare the performance of six data selection strategies that try to maximise f-score performance on domain entity classes in the target corpus.

Related work
It is surprising that there exists, to the best of our knowledge, no controlled study that has shed light on the issue of near domain transfer for bioNER in a straightforward manner. The closest approach to our investigation in the biomedical domain is (Wang et al., 2009). Wang et al. explore potential sources of incompatibility across major bioNER corpora with different annotation schema (GENIA -2000 Medline abstracts, GENETAG -approximately 20,000 Medline sentences and AIMed -225 Medline abstracts). They focus exclusively on protein name recognition and observe a drop in performance of 12% f-score when combining data from different corpora. Various reasons are put forwards such as differences in entity boundary conventions, the scope of the entity class definitions, distributional properties of the entity classes and the degree of overlap between corpora.
A follow up study by the authors (Wang et al., 2010) looked at increasing compatibility between the GENIA and GENETAG corpora by reorganising the annotation schema to unify protein, DNA and RNA NER under a new label GGP (Gene and Gene Product). However the best performance from the coarse grained annotations still do not improve on the intra-corpus data.
In earlier work, (Tsai et al., 2006) looked at schema differences between the JNLPBA corpus of 2000 Medline abstracts (Kim et al., 2004) and the BioCreative corpus of 15,000 Medline sentences (Yeh et al., 2005) and tried to harmonise matching criteria. They demonstrated that relaxing the boundary matching criteria was helpful in maximising the cross-domain performance.
In the clinical domain (Wagholikar et al., 2013), explore the effect of harmonising annotation guidelines on the 2010 i2b2 challenge with Mayo Clinic Rochester (MCR) electronic patient records. They concluded that the effectiveness of pooling -i.e. merging of corpora by ensuring a common format and harmonised semantics -is dependent on several factors including compatibility between the annotation schema and differences in size. Again they noticed that simple pooling resulted in a loss of f-score, 12% for MCR and 4% for i2b2. They concluded that the asymmetry was likely due to size effects of the corpora, i.e. MCR being smaller suffered a greater loss due to the classifier being biased towards i2b2.
Due to the formulation of these studies and their limited scope it has previously been difficult to understand the precise causual factors affecting performance. Our study sheds light on the expected level of loss under different combination strategies and more importantly highlights the non-uniform nature of that loss.

Approach
We assume two small labeled data sets x i represents a covariate or feature vector and y i is a target or label that can take multiple discrete values. We have a learning algorithm that learns a function h : X → Y with minimal loss on the portion of D T used for testing. Any combination of D S and D T which are not used in testing can be used to learn h. Our task is to explore various strategies for data selection and re-factoring labels/features in order to maximise held out performance.

Data
In this paper we aim to empirically test domain transferrence for bioNER under the condition that the test and training data are relatively small and drawn from near domains, i.e. from studies on different types of heritable diseases. To do this we selected Medline abstracts from PubMed that were cited by biocuration experts in the canon-ical database on heritable diseases, the Online Mendelian Inheritance of Man (OMIM) (Hamosh et al., 2005). We selected auto-immune diseases and cardio-vascular diseases for our two corpora which we denote as C1 and C2 respectively. By comparing performance of a single model, a single annotator and a single annotation scheme with a range of sampling techniques we hope to quantify the effects of domain transferrence in isolation.
The target classes for the entities are as follows: ANA Anatomical structures in the body. e.g. liver, heart.
GGP Genes and gene products. e.g. KLKB1 gene, highly penetrant recessive major gene.
PHE Phenotype entities describing observable and measurable characteristic of an organism. e.g. cardiovascular abnormalities, abundant ragged-red fibers, elevated IgE levels.
ORG A living organism. e.g.first-degree relatives, mice.
The two corpora were annotated by a single experienced annotator who had participated in the GENIA entity and event corpus annotation. We developed detailed guidelines for single span none-nested entities before conducting a training and feedback session. Feedback was conducted over two weeks by email and direct meetings with the annotator and then annotation took approximately two months. The characteristics of the two corpora are shown in Table 2. Because annotation was carried out by only one person we do not provide inter-annotator scores.
Importantly, we note four points at this stage: (1) We incorporate a new named entity type, phenotype, which is aligned with investigations into heritable diseases. Semantically it is interesting because phenotypes annotated in the auto-immune literature pertain more often to sub-cellular processes and those in the cardiovascular domain pertain more often to cells, tissues and organs; (2) It can be seen that two NE classes fall well below 500 instances -what we might arbitarily consider the necessary level of support for high levels of performance. These are ANA and CHE;  probability that a word in an entity class X in C1 is also a word in entity class X in C2. b: probability that a word in an entity class X in C2 is also a word in entity class X in C1 (3) We calculated from Table 2 the average number of mentions for each entity form by class and noted that this is relatively stable across corpora, except for DIS which has less variation in C2 than C1 and CHE which has more variation in C2 than C1. When combining evidence from both corpora the approximate order of type/token ratio are P HE < AN A < CHE, GGP < ORG < DIS indicating that on average PHE entities have the greatest variation. Average entity lengths in tokens (not shown) indicate that PHE are significantly longer than other entity mentions; and (4) We calculated the probability that a word token in an entity class from one corpus would appear in an instance of the same entity class in the other corpus, reported as columns a and b. Although the probability of an exact match in instances between entities in the two corpora is generally quite low (below 20% -data not shown) there appears to be significant vocabulary overlap in most classes except for chemicals.

Conditional Random Fields
As in (Finkel and Manning, 2009) we apply our approach to a linear chain conditional random field (CRF) model (Lafferty et al., 2001;McCallum and Wei, 2003;Settles, 2004;Doan et al., 2012) using the Mallet toolkit 1 with default parameters. CRFs have been shown consistently to be among the highest performing bioNER learners. The data selection strategies employed here though are neutral and could have been applied to any other fully supervised learner model.

Features
We made use of a wide range of features, both conventional features such as word or part of speech, as well as gazetteers derived from external classification schemes that have been hand crafted by experts. These are shown in Table 3. Previous studies such as (Ratinov and Roth, 2009)  The feature set is quite large and therefore there is a danger that the learner will be hindered. For feature selection, we conducted baseline test runs under the same experimental conditions as those reported here using a grid search on features F1 to F11 and found that f-score performance was uniformly lower when removing any feature (data not shown but available as supplementary material from the first author).
In order to characterise the contribution each feature is making in label prediction we wanted to provide a measure of similarity between the feature and the class label probability distributions.
Here we use the Gain Ratio (GR) to estimate intracorpus class prediction performance by each feature. GR was used as a splitting function in C4.5 (Quinlan, 1993) and is defined as where C represents a class label and F represents a feature type. IG is information gain and defined as, H is entropy and defined for feature types as, for n feature types f i ∈ F . Further information can be found in (Quinlan, 1993). GR is used in C4.5 in preference to IG because of its ability to normalise for the biases in IG. Generally this results in GR having greater predictive accuracy than IR since it takes into account the number of feature values. Note that GR is undefined when the denominator is zero.
Several points emerge from looking at GR and IG values in Table 3: • C1 (auto-immune) and C2 (cardio-vascular) have about the same information gain contribution from most features but C1 seems to benefit more from GENIA named entity tagging, Human Phenotype Ontology (HPO), Foundation Model of Anatomy (FMA) and Gene Ontology (GO) terms whereas C2 benefits more from the UMLS diseases and ChEBI terms.
• GO, containing terms about genetic processes, has a higher GR in C1 than C2. This supports what we already expected -that auto-immune diseases contain a higher proportion of information about genetic process phenotypes than cardiovascular.
• The GENIA POS tags seem to provide a slightly higher GR in C2 than in C1.
• Despite its large size, UMLS has a smaller GR on both corpora compared to some other resources like HPO or GO or MA. This is despite its high IG value.

Evaluation
Traditional re-sampling using k-fold cross validation (k-CV) divides the n labelled documents into k disjoint subsets of approximately equal size designated as D i for i = 1, .., k. The NER learner is trained successively on k − 1 folds from D and tested on a held out fold over k iterations. In order to preserve independence between contexts in training and held out data we assume here that the unit of division is the document, i.e. a single Medline abstract. Estimated prediction error is calculated based on the learner's labels on the k held out folds. Whilst k-CV is known to be nearly unbiased it is a highly variable estimator. Several studies have looked at k-CV for small sample sets. For example, (Braga-Neto and Dougherty, 2004) found on classifier experiments for small microarray samples (20 <= n <= 120) that whilst k-CV showed low bias they suffered from excessive variance compared to bootstrap or resubstitution estimators. One cause of variance has been identified as within-block and between-block training errors arising from the disproportionate effects of a single abstract appearing in the training set of many folds. In order to reduce this effect Monte Carlo cross validation was used (also called CV with repetition). 100 iterations were used to randomly reorder the documents in the corpora before 10-fold CV sampling was run (cv10r100). Sampling of documents is done without replacement so that the independence between training and testing sets are maintained. Stratification was not applied. Micro averaged f-scores for labeling accuracy were calculated based on the 1000 test folds for each model. Evaluation was done in both directions (training and testing) for each corpus C1 and C2 to show any asymmetrical effects. To minimse the time taken for each experiment a cluster computer was used with 48 nodes.
The matching criteria we employ is the exact match -i.e. the span of the system labeling and the held out data labels should be exactly the same. Although this is not a necessary criteria for some applications such as database curation we used it here as it is widely applied in shared evaluations and shows the clearest effects of modeling choice.
We evaluate using the named entity precision, recall and F-score calculated using the CoNLL 2003 Perl script. This was calculated as, where, and, A true positive (TP) is a gold standard NE tagged by the system as an NE. A true negative (TN) is a gold standard none-NE tagged by the system as a none-NE. A false positive (FP) is a gold standard none-NE tagged by the system as an NE. Evaluation is based on correctly marked whole entities rather than tokens.

Experimental design
In this section we present the experimental conditions we used, starting with a description of the models which we designate M1 to M6 and describe below. All methods made use of 100 iterations of Monte Carlo 10-fold cross validation.

M1: IN DOMAIN We trained and tested on only
the data for the source domain. This methods forms our baseline and represents the standard experimental setting.

M2: OUT DOMAIN
We trained on the source domain and tested on the target domain. This method shows expected loss on near domain transferrence and represents the standard operational setting for users.

M3: MIX-IN
We trained on 100% of the source domain and unified this with 90% of the folded in target domain data, leaving 10% for testing. This method reflects the pooling technique typically employed in corpus construction for bioNER.

M4: STACK
We trained a CRF model on 100% of the source domain and stacked it with another CRF trained on 90% of the folded in target domain data. Stacking employs a meta-classifier and is a popular method for constructing high performance ensembles of classifiers (Ekbal and Saha, 2013). In this case we collected the output labels from the source domain-trained CRF on target sentences and added them as features for the target domain trained CRF.

M5: BINARY CLASS
We re-labeled the complex class PHE as PHE-C1 in C1 and PHE-C2 in C2 and repeated M3. Afterwards we recombined PHE-C1 and PHE-C2 into PHE.

M6: FRUSTRATINGLY SIMPLE
We followed the feature augmentation approach of (Daumé III, 2007). This method effectively provides a joint learning model on C1 and C2 by splitting each feature into three parts: one for sharing cross domain values and one for each domain specific value. We evaluated using the same regime as M3.

Experimental results and discussion
In Table 4 we show f-score performance from near biomedical domains with our six strategies. This section now tries to draw together an interpretation for the performance trends that we see and to drill down to some of the causal factors.
Held out tests performed in-domain (M1) on both corpora C1 and C2 indicate a relatively high level of performance, conservatively in line with state-of-the-art estimates. The broad trend in performance is for entity classes with more instances to out perform others with lower numbers. The class which most obviously breaks this trend is the complex entity type of PHE. To understand this consider that PHE is defined as an observable property on an organism and as such tends to be formed from a quality such as malformed that describes a structural entity such as valve. To see closer what is happening we looked at the confusion matrices for M1 on both corpora. For both C1 and C2 we observed that a substantial proportion of words inside PHE sequences were confused with GGP, DIS or ANA entities. Similarly a high proportion of words inside ANA sequences were confused with PHE entities. This indicates that dependencies within complex biomedical entities like PHE might better be modeled explicitly using tree-structures in a manner similar to events rather than using n-gram relations.
In the M2 out of domain experiments we see a generally severe loss of f-score performance across most classes. Training on C2 and testing on C1 results in a 19.1% loss (F1 69.9 to 50.8) and training on C1 and testing on C2 results in a 11.9% loss overall (F1 58.5 to 46.6). The results agree with Wang et al.'s experience on heterogeneous Medline corpora and extend the upper limit on all-class loss due to domain transferrence to 19%. The only NE class where we see a symmetric benefit from pooling entities in M3 is for ORG (F1 68.4 to 72.2, F1 73.2 to 77.4). Intriguingly the data from Tables 2 and 4 hint at a correlation between the success of M3 pooling for ORG and broad cross-domain compatibility on the vocabulary (over 50% of ORG vocabulary is shared across corpora). However this is not supported in the low sharing case for CHE where we see increased performance from pooling (F1 31.3 to 38.7) when the target is C2 but decreased performance when the target is C1 (F1 29.5 to 20.0).
When we look at the pooling method (M3) and compare to the in-domain method (M1) no obvious size effect occurs for the number of entities in each class. To see this we can examine entity classes with an imbalanced number of instances in C1 and C2 such as CHE, GGP and PHE. Consider the following three cases: (1) Adding 147 instances of CHE from C2 to 44 instances from C1 is associated with CHE performance dropping from M1:29.5 to M3:20.0 when tested on C1; (2) Similarly adding 1430 instances of PHE from C2 to 507 instances from C1 is associated with PHE performance dropping from 46.0 in M1 to 39.7 in M3 when tested on C1; (3) But adding 1663 instances of GGP from C1 to 754 from C2 is associated with GGP rising from 57.2 in M1 to 61.1 in M3. If simply pooling more entities was important to improved f-score we would expect to see a clearer pattern of improvement but we do not.
The overall pooling loss for all classes on M3 is within 3% in both directions and within the  bounds observed by (Wang et al., 2009) and (Wagholikar et al., 2013) for their pooling of heterogeneous Medline corpora. Except for the ORG class which we higlighted above, we might cautiously quantify the loss of pooled entity mentions as being in the range up to 9.5% for CHE but more typically below 4%. The majority of the differences they observed -which are not present in our data -are most likely due to concept definition differences and annotation conventions. In contrast to our expectations the M4 experiments showed very mild benefits for stacking and these were mixed across entity types. M4 tests on C2 showed no general improvement but some improvement in CHE and ORG. M4 tests on C1 resulted again in no overall improvement except for some gain for ORG, supporting our hypothesis that there is greater compatibility in ORG across domains.
The M5 approach of splitting the PHE labels for the two corpora resulted in a noticable improvement over M3 on the C1 test but unfortunately this was not sustained when testing on C2.
It is striking that in the M6 experiments the feature augmentation method only just meets the indomain f-score on C1 and mildly exceeds it on C2. One explanation is that the corpora are so small that a richer feature set has only marginal effects on performance. Table 3 certainly indicates that many of the features have low predictive capacity (gain ratio values below 0.1) in an intra-corpus setting but this is not the case for others such as GENIA NE tags or HPO gazzetteer terms.
Overall when we average the f-scores across models for C1 and C2 we see that there is a marginal benefit to the M1, M4 and M6 strategies over M3 and M5 with M2 suffering the greatest loss in performance.

Conclusion
In this paper we have provided evidence that transference even to closely related domains in biomedical NER incurs a severe loss in f-score. We have demonstrated empirically that strategies that make use of multi-domain corpora such as stacking learners and feature augmentation mitigate the accuracy loss but do not necessarily result in superior performance except for selected classes such as organisms where there appears to be broad terminology consensus. Simple pooling of data across domains failed to exploit size effects especially for the complex class of phenotypes. The list of strategies employed has not been exhaustive and it is possible that others such as feature hierarchies (Arnold et al., 2008) might yield better results.
BioNER is complicated by various factors such as descriptive names, polysemous terms, conjuctions, nested constructions and a high quantity of abbreviations. We have shown that performance is also held back by not considering document level properties related to domain such as topicality. We can expect lower performance and higher annotation costs if we do not adequately allow for the distributional dissimilarities of domains during learning, even in closely related topical settings.