Building a semantically annotated corpus for congestive heart and renal failure from clinical records and the literature

Narrative information in Electronic Health Records (EHRs) and literature articles contains a wealth of clinical information about treatment, diagnosis, medication and family history. This often includes detailed phenotype information for specific diseases, which in turn can help to identify risk factors and thus determine the susceptibility of different patients. Such information can help to improve healthcare applications, including Clinical Decision Support Systems (CDS). Clinical text mining (TM) tools can provide efficient automated means to extract and integrate vital information hidden within the vast volumes of available text. Development or adaptation of TM tools is reliant on the availability of annotated training corpora, although few such corpora exist for the clinical domain. In response, we have created a new annotated corpus (PhenoCHF), focussing on the identification of phenotype information for a specific clinical sub-domain, i.e., congestive heart failure (CHF). The corpus is unique in this domain, in its integration of information from both EHRs (300 discharge summaries) and literature articles (5 full-text papers). The annotation scheme, whose design was guided by a domain expert, includes both entities and relations pertinent to CHF. Two further domain experts performed the annotation, resulting in high quality annotation, with agreement rates up to 0.92 F-Score.


Introduction
An ever-increasing number of scientific articles is published every year. For example, in 2012, more than 500,000 articles were published in MEDLINE (U.S. National Library of Medicine , 2013). A researcher would thus need to review at least 20 articles per day in order to keep up to date with latest knowledge and evidence in the literature (Perez-Rey et al., 2012).
EHRs constitute a further rich source of information about patients' health, representing different aspects of care (Jensen et al., 2012). However, clinicians at the point of care have very limited time to review the potentially large amount of data contained within EHRs. This presents significant barriers to clinical practitioners and computational applications (Patrick et al., 2006).
TM tools can be used to extract phenotype information from EHRs and the literature and help researchers to identify the characteristics of CHF and to better understand the role of the deterioration in kidney function in the cycle of progression of CHF.

Related work
There are many well-known publicly available corpora of scientific biomedical literature, which are annotated for biological entities and/or their interactions (often referred to as events) (Roberts et al., 2009;Xia & Yetisgen-Yildiz, 2012). Examples include GENIA (Kim et al., 2008), BioInfer (Pyysalo et al., 2007) GREC ( Thompson et al., 2009), PennBioIE (Kulick et al., 2004), GENETAG (Tanabe et al., 2005) and LLL'05 (Hakenberg et al., 2005). However, none of these corpora is annotated with the types of entities and relationships that are relevant to the study of phenotype information.
On the other hand, corpora of clinical text drawn from EHRs are rare, due to privacy and confidentiality concerns, but also because of the time-consuming, expensive and tedious nature of producing high quality annotations, which are reliant on the expertise of domain experts (Uzuner et al., 2011). A small number of corpora, however, have been made available, mainly in the context of shared task challenges, which aim to encourage the development of information extraction (IE) systems. These corpora vary in terms of the text type and annotation granularity. For example, the corpus presented in (Pestian et al., 2007) concerns only structured data from radiology reports, while the corpus presented in (Meystre & Haug, 2006) contains unstructured parts of EHRs, but annotated with medical problem only at the document level.
Other corpora are more similar to ours, in that that they include text-bound annotations corresponding to entities or relations. CLEF (Clinical E-Science Framework) (Roberts et al., 2008) was one of the first such corpora to include detailed semantic annotation. It consists of a number of different types of clinical records, including clinic letters, radiology and histopathology reports, which are annotated with a variety of clinical entities, relations between them and co-reference. However, the corpus has not been made publicly available. The more recent 2013 CLEF-eHEALTH challenge (Suominen et al., 2013) corpus consists of EHRs annotated with named entities referring to disorders and acronyms/abbreviations, mapped to UMLS concept identifiers.
The Informatics for Integrating Biology at the Bedside (i2b2) NLP series of challenges have released a corpus of de-identified clinical records annotated to support a number of IE challenges with multiple levels of annotation, i.e., entities and relations (Uzuner et al., 2008;Uzuner, 2009). The 2010 challenge included the release of a corpus of discharge summaries and patient reports in which named entities and relations concerning medical problems, tests and treatments were annotated (Uzuner et al., 2011). A corpus of EHRs from Mayo Clinic has been annotated with both linguistic information (partof-speech tags and shallow parsing results) and named entities corresponding to disorders (Ogren et al., 2008;Savova et al., 2010).

Description of the corpus
The discharge summaries in our PhenoCHF corpus constitute a subset of the data released for the second i2b2 shared task, known as "recognising obesity" (Uzuner, 2009). PhenoCHF corpus was created by filtering the original i2b2 corpus, such that only those summaries (a total of 300) for patients with CHF and kidney failure were retained.
The second part of PhenoCHF consists of the 5 most recent full text articles (at the time of query submission) concerning the characteristics of CHF and renal failure, retrieved from the PubMed Central Open Access database.

Methods and results
The design of the annotation schema was guided by an analysis of the relevant discharge summaries, in conjunction with a review of comparable domain specific schemata and guidelines, i.e., those from the CLEF and i2b2 shared tasks. The schema is based on a set of requirements developed by a cardiologist. Taking into account our chosen focus of annotating phenotype information relating to the CHF disease, the cardiologist was asked firstly to determine a set of relevant entity types that relate to CHF phenotype information and the role of the decline in kidney function in the cycle of CHF (exemplified in Table 1), secondly to locate words that modify the entity (such as polarity clues) and thirdly to identify the types of relationships that exist between these entity types in the description of phenotype information (Table 2) .
Secondly, medical terms in the records are mapped semi-automatically onto clinical concepts in UMLS, with the aid of MetaMap (Aronson, 2001).
The same annotation schema and guidelines were used for both the discharge summaries and the scientific full articles. In the latter, certain annotations were omitted, i.e., organ entities, polarity clues and relations. This decision was taken due to the differing ways in which phenotype information is expressed in discharge summaries and scientific articles. In discharge summaries, phenotype information is explicitly described in the patient's medical history, diagnoses and test results. On the other hand, scientific articles summarise results and research findings. This means that certain types of information that occur frequently in discharge summaries are extremely rare in scientific articles, such that their occurrences are too sparse to be useful in training TM systems, and hence they were not annotated.
The annotation was carried out by two medical doctors, using the Brat Rapid Annotation Tool (brat) (Stenetorp et al., 2012), a highlyconfigurable and flexible web-based tool for textual annotation.
Annotations in the corpus should reflect the instructions provided in the guidelines as closely as possible, in order to ensure that the annotations are of ahigh quality. A standard means of providing evidence regarding the reliability of annotations in a corpus is to calculate a statistic known as the inter-annotator agreement (IAA). IAA provides assurance that different annotators can produce the same annotations when working independently and separately. There are several different methods of calculating IAA, which can be influenced by the exact nature of the annotation task. We use the measures of precision, recall and F-measure to indicate the level of inter-annotator reliability (Hripcsak & Rothschild, 2005). In order to carry out such calculations, one set of annotations is considered as a gold standard and the total number of correct entities is the total number of entities annotated by this annotator.
Precision is the percentage of correct positive predictions annotated by the second annotator, compared to the first annotator's assumed gold standard. It is calculated as follows:

P = TP / TP + FP
Recall is the percentage of positive cases recognised by the second annotator. It is calculated as follows:

R = TP / TP + FN
F-score is the harmonic mean between precision and recall.

F-score = 2* (Precision * Recall) / Precision + Recall
We have calculated separate IAA scores for the discharge summaries and the scientific articles. Table 3 summarises agreement rates for term annotation in the discharge summaries, showing results for both individual entity types and macro-averaged scores over all entity types. Relaxed matching criteria were employed, such that annotations added by the two annotators were considered as a match if their spans overlapped. In comparison to related efforts, the IAA rates shown in Table 3 are high. However, it should be noted that the number of targeted classes and relations in our corpus is small and focused, compared to other related corpora.
Agreement statistics for scientific articles are shown in Table 4. Agreement is somewhat lower than for discharge summaries, which this could be due to the fact that the annotators (doctors) are more used to dealing with discharge summaries in their day-to-day work, and so are more accustomed to locating information in this type of text. Scientific articles are much longer and generally include more complex language, ideas and analyses, which may require more than one reading to fully comprehend the information within them. Table 5 shows the agreement rates for relation annotation in the discharge summaries. The agreement rates for relationships are relatively high. This can partly be explained by the deep domain knowledge possessed by the annotators and partly by the fact that the relationships to be identified were relatively simple, linking only two pre-annotated entities.

Finding
This relationship links the organ to the manifestation or abnormal variation that is observed during the diagnosis process.

Negate
This is one-way relation to relate a negation attribute (polarity clue) to the condition it negates.