Ontology-based Extraction of Structured Information from Publications on Preclinical Experiments for Spinal Cord Injury Treatments

Preclinical research in the ﬁeld of central nervous system trauma advances at a fast pace, currently yielding over 8,000 new publications per year, at an exponentially growing rate. This amount of published information by far exceeds the capacity of individual scientists to read and understand the relevant literature. So far, no clinical trial has led to therapeutic approaches which achieve functional recovery in human patients. In this paper, we describe a ﬁrst prototype of an ontology-based information extraction system that automatically extracts relevant preclinical knowledge about spinal cord injury treatments from nat-ural language text by recognizing participating entity classes and linking them to each other. The evaluation on an independent test corpus of manually annotated full text articles shows a macro-average F 1 measure of 0.74 with precision 0.68 and recall 0.81 on the task of identifying entities participating in relations.


Introduction
Injury to the central nervous system of adult mammals typically results in lasting deficits, like permanent motor and sensor impairments, due to a lack of profound neural regeneration. Specifically, patients who have sustained spinal cord injuries (SCI) usually remain partially paralyzed for the rest of their lives. Preclinical research in the field of central nervous system trauma advances at fast pace, currently yielding over 8,000 new publications per year, at an exponentially growing rate, with a total amount of approximately 160,000 PubMed-listed papers today. 2 However, translational neuroscience faces a strong disproportion between the immense preclinical research effort and the lack of successful clinical trials in SCI therapy: So far, no therapeutic approach has led to functional recovery in human patients (Filli and Schwab, 2012). As the vast amount of published information by far exceeds the capacity of individual scientists to read and understand the relevant knowledge (Lok, 2010), the selection of promising therapeutic interventions for clinical trials is notoriously based on incomplete information (Prinz et al., 2011;Steward et al., 2012).
Thus, automatic information extraction methods are needed to gather structured, actionable knowledge from large amounts of unstructured text that describe outcomes of preclinical experiments in the SCI domain. Being stored in a database, such knowledge provides a highly valuable resource enabling curators and researchers to objectively assess the prospective success of experimental therapies in humans, and supports the cost-effective execution of meta studies based on all previously published data. First steps towards such a database have already been undertaken by manually extracting the desired information from a limited number of papers (Brazda et al., 2013), which is not feasible on a large scale, though.
In this paper, we present a first prototype of an automated ontology-based information extraction system for the acquisition of structured knowledge about experimental SCI therapies. As main contributions, we point out the highly relational problem structure by describing the entity classes and relations relevant for 1 The first four authors contributed equally. 2 As in this query to the database PubMed (link to http://www.ncbi.nlm.nih.gov/pubmed), as of April 2014.

Regular Expression
Rule-based recombination  Figure 1: Workflow of our implementation, from the input PDF document to the generation of the output relations. Named entity recognition is described in Section 3.1, relation extraction in Section 3.2.
knowledge representation in the domain, and provide a cascaded workflow that is capable of extracting these relational structures from unstructured text with an average F 1 measure of 0.74.

Related Work
Our workflow for acquiring structured information in the domain of spinal cord injury treatments is an example of ontology-based information extraction systems (Wimalasuriya and Dou, 2010): Large amounts of unstructured natural language text are processed through a mechanism guided by an ontology, in order to extract predefined types of information. Our long-term goal is to represent all relevant information on SCI treatments in structured form, similar to other automatically populated databases in the biomedical domain, such as STRING-DB for protein-protein interactions (Franceschini et al., 2013), among others.
A strong focus in biomedical information extraction has long been on named entity recognition, for which machine-learning solutions such as conditional random fields (Lafferty et al., 2001) or dictionary-based systems (Schuemie et al., 2007;Hanisch et al., 2005;Hakenberg et al., 2011) are available which tackle the respective problem with decent performance and for specific entity classes such as organisms (Pafilis et al., 2013) or symptoms (Savova et al., 2010;Jimeno et al., 2008). A detailed overview on named entity recognition, covering other domains as well, can be found in Nadeau and Sekine (2007).
The use case described in this paper, however, involves a highly relational problem structure in the sense that individual facts or relations have to be aggregated in order to yield accurate, holistic domain knowledge, which corresponds most closely to the problem structure encountered in event extraction, as triggered by the ACE program (Doddington et al., 2004;Ji and Grishman, 2008;Strassel et al., 2008), and the BioNLP shared task series (Nedellec et al., 2013;Tsujii et al., 2011;Tsujii, 2009). General semantic search engines in the biomedical domain mainly focus on isolated entities. Relations are typically only taken into account by co-occurrence on abstract or sentence level. Examples for such search engines include GoPubMed (Doms and Schroeder, 2005), SCAIView (Hofmann-Apitius et al., 2008), and GeneView (Thomas et al., 2012).
With respect to the extraction methodology, our work is similar to Saggion et al. (2007) and Buitelaar et al. (2008), in that a combination of gazetteers and extraction rules is derived from the underlying ontology, in order to adapt the workflow to the domain of interest. A schema in terms of a reporting standard has recently been proposed by the MIASCI-consortium (Lemmon et al., 2014, Minimum Information About a Spinal Cord Injury Experiment). To the best of our knowledge, our work is the first attempt at automated information extraction in the SCI domain.

Method and Architecture
An illustration of the proposed workflow is shown in Figure 1. Based on the unstructured information management architecture (UIMA, Ferrucci and Lally (2004)), full text PDF documents serve as input to the workflow. Plain text and structural information are extracted from these documents using Apache PDFBox 3 .
The proposed system extracts relations which we define as templates that contain slots, each of which is to be filled by an instance of a particular entity class (cf. Table 1). At the same time, a particular instance can be a filler for different slots (cf. Figure 2). We argue that a relational approach is essential to information extraction in the SCI domain as (i) many instances of entity classes found in the text do not convey relevant  Table 1: A detailed list of relations and the entity classes whose instances are valid slot fillers for them. Examples for instances of each entity class are also shown, as well as the extraction method, and resources used for extraction. Instances are either extracted from the text using regular expressions (R) or on a lookup in our ontology database (O). Resources in italics were specifically created for this application, resources in SMALL CAPITALS are regular expression-based recombinations of other entities. Entity classes in bold face are required arguments for relation extraction (cf. Section 3.2). The count specifies the number of elements in the respective resource.
information on their own, but only in combination with other instances (e. g., surgical devices mentioned in the text are only relevant if used to inflict a spincal cord injury to the animals in an experimental group), and (ii) a holistic picture of a preclinical experiment can only be captured by aggregating several relations (e. g., a certain p value being mentioned in the text implies a particular treatment of one group of animals to be significantly different from another treatment of a control group). We take four relations (Animal, Injury, Treatment and Result) into account which capture the semantic essence of a preclinical experiment: Laboratory animals are injured, then treated and the effect of the treatment is measured. Table 1 provides an overview of all entity classes and relations. The workflow consists of two steps: Firstly, rule-and ontology-based named entity recognition (NER) is performed (cf. Section 3.1). Secondly, the pool of entities recognized during NER serves as a basis for relation extraction (cf. Section 3.2).

Ontology-based Named Entity Recognition
We store ontological information in a relational database as a set of directed graphs, accompanied by a dictionary for efficient token lookup. Each entity is stored with possible linguistic surface forms (e. g., "Wistar rats" as a surface form of the Wistar rat entity from the class Laboratory Animal). Each surface form s is tokenized (on white space and non-alphanumeric symbols, including transformation to lowercase, e. g., leading to tokens "wistar" and "rats") and normalized (stemming, removal of special characters and stop words) resulting in a set of dictionary keys (e. g., "wistar" and "rat"). The resources used as content for the ontology are shown in Table 1. We use specifically crafted resources for our use case 4 as well as the Five adult male guinea pigs weighing 200-250 g.

Animal Animal
Organism: guinea pigs  Figure 2: Two example instances of the Animal relation that can be generated from the same text. Given its entity class, the number 200 is a valid filler for the 'number' slot as well as the 'weight' slot. Both candidates are generated and ranked according to their probability (cf. Equation 4). The manually defined constraints of p sem ensure that 200 cannot fill both slots at the same time.
NCBI taxonomy 5 and the Medical Subject Headings 6 (MeSH). The process of ontology-based NER consists of (i) token lookup in the dictionary, (ii) candidate generation, (iii) probabilistic candidate filtering and (iv) ontological reduction (cf. Figure 1). Token lookup. For each token t in the document, the corresponding surface form tokens s t are retrieved from the database. A confidence value p conf based on the Damerau-Levenshtein-Distance without swaps (dld, Damerau (1964)) is calculated as where |t| denotes the number of characters in token t. Assuming to find t = "rat" in the text with the according surface form s t = ("wistar", "rats"), p conf (t, s t ) = 1 − 1 4 = 0.75. Tokens with p conf < 0.5 are discarded. Candidate generation. A candidate h for matching the surface form tokens s h is a list of tokens (t h 1 , . . . , t h n ) from the text. Candidates are constructed using all possible combinations of matching tokens for each surface form token (as retrieved above). To keep this tractable, we restrict the search space to combinations with the proximity ) models the distance between two tokens u and v in the text with N W , N S , N P denoting the number of words, sentences and paragraphs between u and v. In our example, a candidate would be h = ("rat"). Candidate filtering. For a candidate h and the surface form tokens s h it refers to, we calculate a total match probability, taking into account the distance d(u, v) of all tokens in the candidate, the confidence p conf (t , s h ) that the token actually belongs to the surface form, and the ratio t ∈h |t |/ t∈s h |t| of the surface form tokens covered by the candidate: models the confidence that two tokens u and v belong together given their distance in the text. In our example of the candidate h = ("rat") with the surface form tokens s h = ("wistar", "rats") is p match (h, s h ) = 1 · 0.75 · 3 6+4 = 0.225. Candidates with p match < 0.7 are discarded. The resulting set of all recognized candidates is denoted with H. Ontological reduction. As the algorithm ignores the hierarchical information provided by the ontologies, we may obtain overlapping matches for ontologically related entities. Therefore, in case of overlapping entities that are related in an "is a" relationship in the ontology, only the more specific one is kept. Assume for instance the candidates "Rattus norvegicus" and "Rattus norvegicus albus", where the latter is more specific and therefore accepted.

Relation Extraction
We frame relation extraction as a template filling task such that each slot provided by a relation has to be assigned a filler of the correct entity class. Entity classes for the four relations of interest are shown in Table 1, where required slots are in bold face, whereas all other slots are optional.
The slot filling process is based on testing all combinations of appropriate entities taking into account their proximity and additional constraints. In more detail, we define the set of all recognized relations R θ of a type θ as where P(H) denotes the power set over all candidates H recognized by NER. g(r θ ) returns the filler for the required slot of r θ , p match and p dist are defined as in Section 3.1 and p sem implements manually defined constraints on r θ : A wrongly typed filler h for one slot of r θ leads to p sem (r θ ) = 0, as does a negative number in the Number slot of the Animal relation. Animal Numbers larger than 100 or Animal Weights smaller than 1 g or larger than 1 t are punished. All other cases lead to p sem (r θ ) = 1. Note that p match (h, s h ) = 1 for candidates h retrieved by rule-based entity recognition. Further, we set σ Animal = σ Treatment = 6, σ Injury = 10 and σ Result = 15.  The workflow is evaluated against an independent, manually annotated corpus of 32 complete papers which contain 1186 separate annotations of entities, produced by domain experts 7 . Information about relations is not provided in the corpus. Only entities which participate in the description of the preclinical experiment are marked. The frequencies of annotations among the different classes are shown in Table 2.

Experimental Settings
We evaluate the system with regard to two different tasks: extraction ("Is the approach able to extract relevant information from the text, without regard to the exact location of the information?") and annotation ("Is the system able to annotate relevant information at the correct location as indicated by medical experts?"). Furthermore, we distinguish between an all instances setting, where we consider all instances independently, and a fillers only setting, where only those annotations in the system output are considered, that are fillers in a relation (i.e. the fillers only-setting evaluates a subset of the all instances-setting). The relation extraction procedure is not evaluated separately. For each setting, we report precision, recall, and F 1 measure. Taking the architecture into account, we have the following hypotheses: (i) For the all instances setting we expect high recall, but low precision. (ii) For the fillers only setting, precision should increase notably. (iii) Comparing the all entities and the fillers only setting, recall should remain at the same level. We therefore expect the extraction task to be simpler than the annotation task: For any information to be annotated at the correct position, it must have been extracted correctly. On the other hand, information that has been extracted correctly, can still be found at a 'wrong' location in the text. Thus, we expect a drop of precision and recall when moving from extraction to annotation.

Results
The results are presented in Table 3: For each relation mentioned in Section 3, and the entity classes participating in it, we report precision, recall and F 1 -measure 8 . This is done for all four combinations of setting and task. For each relation we also provide the macro-average of precision, recall and F 1 -measure over all entity classes considered in that relation and the overall average.  Table 3: The macro-averaged evaluation results for each class given in precision, recall and F 1 measure.
For the extraction task with all instances setting, recall is close to 100% for all entity classes considered in the Animal relation. It is 81% for Dosages. The rule-based recognition for Dosages (as for Ages and p Values) is very precise: All recognized entities have been annotated by medical experts somewhere in the document. This strong difference between entity classes can be observed in the annotation task and the fillers only setting as well: The best average performance in F 1 -measure is achieved for entity classes that are part of the Animal relation. Precision is best for Dosages, Ages and p Values.
The recall for the all instances setting is high in both the extraction and in the annotation task. However, the number of annotated instances (29,628 annotations in total) is about 25 times higher than the number of expert annotations, which leads to low precision especially in the annotation task. For the fillers only setting, the number of annotations decreases dramatically (to 4069 annotations); at the same time, precision improves. Regarding the comparison of both tasks, precision and recall are both notably lower in the annotation task, for the all entities setting, as well as for the fillers only setting. The overall recall is lower by 14 percentage points (pp) in the extraction task and by 26 pp in the annotation task when considering the fillers only setting. The decrease is most pronounced for Investigation Methods in the annotation task with a drop of 50 pp.

Discussion
The results are promising for named entity recognition. Recall is close-to-perfect in the extraction task and acceptable in the annotation task. The results for relation extraction leave space for improvement: An increase in precision can be observed but the decrease in recall is too substantial. The Animal relation is an exception, where an increase in F 1 measure is observed for the fillers only setting for nearly all entity classes, leading to 0.87 F 1 for Animals in the extraction task.
An error analysis revealed that for the fillers only setting, most false positives (55%) are due to the fact that the medical experts did not annotate all occurrences of the correct entity, but only one or a few. 18% are due to ambiguities of surface forms (for instance the abbreviation "it" for "intrathecal" leads to many false positives). Regarding false negatives, 41% are due to missing entries in our ontology database and further 26% are caused by wrong treatment of characters (mostly wrong transcriptions of characters from the PDF).

Conclusion and Outlook
We described the challenge of extracting relational descriptions about preclinical experiments on spinal cord injury from scientific literature. To tackle that challenge, we introduced a cascaded approach of named entity recognition, followed by relation extraction. Our results show that the first step can be achieved by relying strongly on domain-specific ontologies. We show that modeling relations as aggregated entities, and extracting them using a distance filtering principle combined with domain specific knowledge, yields promising results, specifically for the Animal relation.
Future work will focus on improving the recognition at the correct position in the text. This is a prerequisite to actually tackle and evaluate the relation extraction not only on the basis of detected participating entities. Therefore, improved relation detection approaches will be implemented which relax the assumption that relevant entities are found close-by in the text. In addition, we will relax the assumption that different slots of the annotation are all equally important. Finally, we will address aggregation beyond individual relations in order to allow for a fully accurate holistic assessment of experimental therapies.
Our system offers a semantic analysis of scientific papers on spinal cord injuries. This lays groundwork for populating a comprehensive semantic database on preclinical studies of SCI treatment approaches as described by Brazda et al. (2013), laying ground and supporting transfer from preclinical to clinical knowledge in the future.