Tuning HeidelTime for identifying time expressions in clinical texts in English and French

We present work on tuning the Heideltime system for identifying time expressions in clinical texts in English and French languages. The main amount of the method is related to the enrichment and adaptation of linguistic resources to identify Timex3 clinical expressions and to normalize them. The test of the adapted versions have been done on the i2b2/VA 2012 corpus for English and a collection of clinical texts for French, which have been annotated for the purpose of this study. We achieve a 0.8500 F-measure on the recognition and normalization of temporal expressions in English, and up to 0.9431 in French. Future work will allow to improve and consolidate the results.


Introduction
Working with unstructured narrative texts is very demanding on automatic methods to access, formalize and organize the information contained in these documents. The first step is the indexing of the documents in order to detect basic facts which will allow more sophisticated treatments (e.g., information extraction, question/answering, visualization, or textual entailment). We are mostly interested in indexing of documents from the medical field. We distinguish two kinds of indexing: conceptual and contextual.
Conceptual indexing consists in finding out the mentions of notions, terms or concepts contained in documents. It is traditionally done thanks to the exploitation of terminological resources, such as MeSH (NLM, 2001), SNOMED International (Côté et al., 1993), SNOMED CT (Wang et al., 2002), etc. The process is dedicated to the recognition of these terms and of their variants in documents (Nadkarni et al., 2001;Mercer and Di Marco, 2004;Bashyam and Taira, 2006;Schulz and Hahn, 2000;Davis et al., 2006).
The purpose of contextual indexing is to go further and to provide a more fine-grained annotation of documents. For this, additional information may be searched in documents, such as polarity, certainty, aspect or temporality related to the concepts. If conceptual indexing extracts and provides factual information, contextual indexing is aimed to describe these facts with more details. For instance, when processing clinical records, the medical facts related to a given patient can be augmented with the associated contextual information, such as in these examples: (1) Patient has the stomach aches.
(3) After taking this medication, patient started to have the stomach aches.
(4) Two weeks ago, patient experienced the stomach aches.
In example (1), the information is purely factual, while it is negated in example (2). Example (3) conveys also aspectual information (the medical problem has started). In examples (4) and (5), medical events are positioned in the time: relative (two weeks ago) and absolute (in January 2014). We can see that the medical history of patient can become more precise and detailed thanks to such contextual information. In this way, factual information related to the stomach aches of patient may receive these additional descriptions which make each occurrence different and nonredundant. Notice that the previous I2B2 contests 1 addressed the information extraction tasks related to different kinds of contextual information.
Temporality has become an important research field in the NLP topics and several challenges addressed this taks: ACE (ACE challenge, 2004), SemEval (Verhagen et al., 2007;Verhagen et al., 2010;UzZaman et al., 2013), I2B2 2012 (Sun et al., 2013). We propose to continue working on the extraction of temporal information related to medical events. This kind of study relies on several important tasks when processing the narrative documents : identification and normalization of linguistic expressions that are indicative of the temporality (Verhagen et al., 2007;Chang and Manning, 2012;Strötgen and Gertz, 2012;Kessler et al., 2012), and their modelization and chaining (Batal et al., 2009;Moskovitch and Shahar, 2009;Sun et al., 2013;Grouin et al., 2013). The identification of temporal expressions provides basic knowledge for other tasks processing the temporality information. The existing available automatic systems such as Hei-delTime (Strötgen and Gertz, 2012) or SUTIME (Chang and Manning, 2012) exploit rule-based approaches, which makes them adaptable to new data and areas. During a preliminary study, we tested several such systems for identification of temporal relations and found that HeidelTime has the best combination of performance and adaptability. We propose to exploit this automatic systems, to adapt and to test it on the medical clinical documents in two languages (English and French).
In the following of this study, we introduce the corpora (Section 2) and methods (Section 3). We then describe and discuss the obtained results (Section 4.2) and conclude (Section 5).

Material
Corpora composed of training and test sets are the main material we work with. The corpora are in two languages, English and French, and has comparable sizes. All the processed corpora are deidentified. Corpora in English are built within the I2B2 2012 challenge (Sun et al., 2013). The training corpus consists of 190 clinical records and the test corpus of 120 records. The reference data contain annotations of temporal expressions according to the Timex3s guidelines: date, duration, frequency and time . Corpora in French are built on purpose of this study. The clinical documents are issued from a French hospital. The training corpus consists of 182 clinical records and the test corpus of 120 records. 25 documents from the test set are annotated to provide the reference data for evaluation.

Method
HeidelTime is a cross-domain temporal tagger that extracts temporal expressions from documents and normalizes them according to the Timex3 annotation standard, which is part of the markup language TimeML . This is a rule-based system. Because the source code and the resources (patterns, normalization information, and rules) are strictly separated, it is possible to develop and implement resources for additional languages and areas using HeidelTime's rule syntax. HeidelTime is provided with modules for processing documents in several languages, e.g. French (Moriceau and Tannier, 2014). In English, several versions of the system exist, such as general-language English and scientific English.
HeidelTime uses different normalization strategies depending on the domain of the documents that are to be processed: news, narratives (e.g. Wikipedia articles), colloquial (e.g. SMS, tweets), and scientific (e.g. biomedical studies). The news strategy allows to fix the document creation date. This date is important for computing and normalizing the relative dates, such as two weeks ago or 5 days later, for which the reference point in time is necessary: if the document creation date is 2012/03/24, two weeks ago becomes 2012/03/10.
Our method consists of three steps: tuning Hei-delTime to clinical data in English and French (Section 3.1), evaluation of the results (Section 3.2), and exploitation of the computed data for the visualization of the medical events (Section 3.3).

Tuning HeidelTime
While HeidelTime proposes a good coverage of the temporal expressions used in general language documents, it needs to be adapted to specialized areas. We propose to tune this tool to the medical domain documents. The tuning is done in two languages (English and French). Tuning involves three aspects: 1. The most important adaptation needed is related to the enrichment and encoding of linguistic expressions specific to medical and especially clinical temporal expressions, such as post-operative day #, b.i.d. meaning twice a day, day of life, etc.
2. The admission date is considered as the reference or starting point for computing relative dates, such as 2 days later. For the identification of the admission date, specific preprocessing step is applied in order to detect it within the documents; 3. Additional normalizations of the temporal expressions are done for normalizing the durations in approximate numerical values rather than in the undefined 'X'-value; and for external computation for some durations and frequencies due to limitations in Heidel-Time's internal arithmetic processor.

Evaluating the results
HeidelTime is tuned on the training set. It is evaluated on the test set. The results generated are evaluated against the reference data with: • precision P: percentage of the relevant temporal expressions extracted divided by the total number of the temporal expressions extracted; • recall R: percentage of the relevant temporal expressions extracted divided by the number of the expected temporal expressions; • APR: the arithmetic average of the precision and recall values P+R 2 ; • F-measure F: the harmonic mean of the precision and recall values P * R P+R .

Exploiting the results
In order to judge about the usefulness of the temporal information extracted, we exploit it to build the timeline. For this, the medical events are associated with normalized and absolute temporal information. This temporal information is then used to order and visualize the medical events.

Experiments
The experiments performed are the following. Data in English and French are processed. Data in two languages are processed by available versions of HeidelTime: two existing versions (general language and scientific language) and the medical version created thanks to the work performed in this study. Results obtained are evaluated against the reference data.

Results
We added several new rules to HeidelTime (164 in English and 47 in French) to adapt the recognition of temporal expressions in medical documents. Some cases are difficult to annotate. For instance, it is complicated to decide whether some expressions are concerned with dates or durations. The utterance like 2 years ago (il y a 2 ans) is considered to indicate the date. The utterance like since 2010 (depuis 2010) is considered to indicate the duration, although it can be remarked that the beginning of the duration interval marks the beginning of the process and its date. Another complex situation appears with the relative dates: • as already mentioned, date like 2 years ago (il y a 2 ans) are to be normalized according to the reference time point; • a more complex situation appears with expressions like the day of the surgery (le jour de l'opération) or at the end of the treatment by antiobiotics (à la fin de l'antibiothrapie), for which it is necessary first to make the reference in time of the other medical event before being able to define the date in question.
In Table 1, we present the evaluation results for English. On the training corpus, with the general language version and the scientific version of Hei-delTime, we obtain F-measure around 0.66: precision (0.77 to 0.79) is higher than recall (0.56). The values of F-measure and APR are identical. The version we adapted to the medical language provides better results for all the evaluation measures used: F-measure becomes then 0.84, with precision up to 0.85 and recall 0.84. This is a good improvement of the automatic tool which indicates that specialized areas, such as medical area, use indeed specific lexicon and constructions. Interestingly, on the test corpus, the results decrease for the general language and scientific versions of HeidelTime, but increase for the medical version of HeidelTime, with F-measure 0.85. During the I2B2 competition, the maximal F-measure obtained was 0.91. With F-measure 0.84, our system was ranked 10/14 on the English data. Currently, we improve these previous results.
In Table 2, we present the results obtained on the French test corpus (26 documents). Two versions of HeidelTime are applied: general language, that is already available, and medical, that has been developed in the presented work. We can   observe that the adapted version suits better the content of clinical documents and improves the Fmeasure values by 3 points, reaching up to 0.94. The main limitation of the system is due to the incomplete coverage of the linguistic expressions (e.g. au cours de, mensuel (during, monthly)). Among the current false positives, we can find ratios (2/10 is considered as date, while it means lab results), polysemous expressions (Juillet in rue du 14 Juillet (14 Juillet street)), and segmentation errors (few days detected instead of the next few days). These limitations will be fixed in the future work.

Versions of HeidelTime
In Figure 1, we propose a visualization of the temporal data, which makes use of the temporal information extracted. In this way, the medical events can be ordered thanks to their temporal anchors, which becomes a very useful information presentation in clinical practice (Hsu et al., 2012). The visualization of unspecified expressions (e.g. later, sooner) is being studied. Although it seems that such expressions often occur with more spe-cific expressions (e.g. later that day).

Conclusion
HeidelTime, an existing tool for extracting and normalizing temporal information, has been adapted to the medical area documents in two languages (English and French). It is evaluated against the reference data, which indicates that its tuning to medical documents is efficient: we reach F-measure 0.85 in English and up to 0.94 in French. More complete data in French are being annotated, which will allow to perform a more complete evaluation of the tuned version. We plan to make the tuned version of HeidelTime freely available. Automatically extracted temporal information can be exploited for the visualization of the clinical data related to patients. Besides, these data can be combined with other kinds of contextual information (polarity, uncertainty) to provide a more exhaustive picture of medical history of patients.