Automatic diagnosis of understanding of medical words

Within the medical ﬁeld, very specialized terms are commonly used, while their understanding by laymen is not always successful. We propose to study the under-standability of medical words by laymen. Three annotators are involved in the creation of the reference data used for training and testing. The features of the words may be linguistic ( i.e. , number of characters, syllables, number of morphological bases and afﬁxes) and extra-linguistic ( i.e. , their presence in a reference lexicon, frequency on a search engine). The automatic categorization results show between 0.806 and 0.947 F-measure values. It appears that several features and their combinations are relevant for the analysis of understandability ( i.e. , syntactic categories, presence in reference lexica, frequency on the general search engine, ﬁnal substring).


Introduction
The medical field has deeply penetrated our daily life, which may be due to personal or family health condition, watching TV and radio broadcasts, reading novels and journals. Nevertheless, the availability of this kind of information does not guarantee its correct understanding, especially by laymen, such as patients. The medical field has indeed a specific terminology (e.g., abdominoplasty, hepatic, dermabrasion or hepatoduodenostomy) commonly used by medical professionals. This fact has been highlighted in several studies dedicated for instance to the understanding of pharmaceutical labels (Patel et al., 2002), of information provided by websites (Rudd et al., 1999;Berland et al., 2001;McCray, 2005;Oregon Evidencebased Practice Center, 2008), and more generally the understanding between patients and medical doctors (AMA, 1999;McCray, 2005;Jucks and Bromme, 2007;Tran et al., 2009).
We propose to study the understanding of words used in the medical field, which is the first step towards the simplification of texts. Indeed, before the simplification can be performed, it is necessary to know which textual units may show understanding difficulty and should be simplified. We work with data in French, such as provided by an existing medical terminology. In the remainder, we present first some related work, especially from specialized fields (section 2). We then introduce the linguistic data (section 4) and methodology (section 5) we propose to test. We present and discuss the results (section 6), and conclude with some directions for future work (section 7).

Studying the understanding of words
The understanding (of words) may be seen as a scale going from I can understand to I cannot understand, and containing one or more intermediate positions (i.e., I am not sure, I have seen it before but do not remember the meaning, I do not know but can interpret). Notice that it is also related to the ability to provide correct explanation and use of words. As we explain later, we consider words out of context and use a three-position scale. More generally, understanding is a complex notion closely linked to several other notions studied in different research fields. For instance, lexical complexity is studied in linguistics and gives clues on lexical processes involved, that may impact the word understanding (section 2.1). Work in psycholinguistics is often oriented on study of word opacity and the mental processes involved in their understanding (Jarema et al., 1999;Libben et al., 2003). Readability provides a set of methods to compute and quantify the understandability of words (section 2.3). The specificity of words to specialized areas is another way to capture their understandability (section 2.2). Finally, lexical simplification aims at providing simpler words to be used in a given context (section 2.3).

Linguistics
In linguistics, the question is closely related to lexical complexity and compoundings. It has been indeed observed that at least five factors, linguistic and extra-linguistic, may be involved in the semantic complexity of the compounds. One factor is related to the knowledge of the components of the complex words. Formal (how the words, such as aérenchyme, can be segmented) and semantic (how the words can be understood and used) points of view can be distinguished. A second factor is that complexity is also due to the variety of morphological patterns and relations among the components. For instance,érythrocyte (erythrocyte) and ovocyte (ovocyte) instantiate the [N1N2] pattern in which N2 (cyte) can be seen as a constant element (Booij, 2010), although the relations between N1 and N2 are not of the same type in these two compounds: inérythrocyte, N1érythr(o) denotes a property of N2 (color), while in ovocyte, N1 ovo (egg) corresponds to a specific development stage of female cells. Another factor appears when some components are polysemous, within a given field (i.e., medical field) or across the fields. For instance, aér(o) does not always convey the same meaning: in aérocèle, aér-denotes 'air' (tumefaction (cèle) formed by an air infiltration), but not in aérasthénie, which refers to an asthenia (psychic disorder) observable among jet pilots. Yet another factor may be due to the difference in the order of components: according to whether the compounding is standard (in French, the main semantic element is then on the left, such as in pneu neige (snow tyre), which is fundamentally a pneu (tyre)) or neoclassical (in French, the main semantic element is then on the right, such aś erythrocyte, which is a kind of cyte cell / corpuscle with red color). It is indeed complicated for a user without medical training to correctly interpret a word that he does not know and for which he cannot reuse the existing standard compounding patterns. This difficulty is common to all Roman languages (Iacobini, 2003), but not to Germanic languages (Lüdeling et al., 2002). Closely related is the fact that with neoclassical compounds, a given component may change its place according to the global semantics of the compounds, such as pathin pathology, polyneuropathe, cardiopathy. Fi-nally, the formal similarity between some derivation processes (such as the derivation in -oide, like in lipoid) and neoclassical compounding (such as -ase in lipase), which apply completely different interpretation patterns (Iacobini, 1997;Amiot and Dal, 2005), can also make the understanding more difficult.

Terminology
In the terminology field, the automatic identification of difficulty of terms and words remains implicit, while this notion is fundamental in terminology (Wüster, 1981;Cabré and Estopà, 2002;Cabré, 2000). The specificity of terms to a given field is usually studied. The notion of understandability can be derived from it. Such studies can be used for filtering the terms extracted from specialized corpora (Korkontzelos et al., 2008). The features exploited include for instance the presence and the specificity of pivot words (Drouin and Langlais, 2006), the neighborhood of the term in corpus or the diversity of its components computed with statistical measures such as C-Value or PageRank (Daille, 1995;Frantzi et al., 1997;Maynard and Ananiadou, 2000). Another possibility is to check whether lexical units occur within reference terminologies and, if they do, they are considered to convey specialized meaning (Elhadad and Sutaria, 2007).

NLP studies
The application of the readability measures is another way to evaluate the complexity of words and terms. Among these measures, it is possible to distinguish classical readability measures and computational readability measures (François, 2011). Classical measures usually rely on number of letters and/or of syllables a word contains and on linear regression models (Flesch, 1948;Gunning, 1973), while computational readability measures may involve vector models and a great variability of features, among which the following have been used to process the biomedical documents and words: combination of classical readability formulas with medical terminologies (Kokkinakis and Toporowska Gronostaj, 2006); n-grams of characters (Poprat et al., 2006), manually (Zheng et al., 2002) or automatically (Borst et al., 2008) defined weight of terms, stylistic (Grabar et al., 2007) or discursive (Goeuriot et al., 2007) features, lexicon (Miller et al., 2007), morphological features (Chmielik and Grabar, 2011), combi- nations of different features (Wang, 2006;Zeng-Treiler et al., 2007;Leroy et al., 2008). Specific task has been dedicated to the lexical simplification within the SemEval challenge in 2012 1 . Given a short input text and a target word in English, and given several English substitutes for the target word that fit the context, the goal was to rank these substitutes according to how "simple" they are (Specia et al., 2012). The participants applied rule-based and/or machine learning systems. Combinations of various features have been used: lexicon from spoken corpus and Wikipedia, Google n-grams, WordNet (Sinha, 2012); word length, number of syllables, latent semantic analysis, mutual information and word frequency (Jauhar and Specia, 2012); Wikipedia frequency, word length, n-grams of characters and of words, random indexing and syntactic complexity of documents (Johannsen et al., 2012); n-grams and frequency from Wikipedia, Google n-grams (Ligozat et al., 2012); WordNet and word frequency (Amoia and Romanelli, 2012).

Aims of the present study
We propose to investigate how the understandability of French medical words can be diagnosed with NLP methods. We rely on the reference annotations performed by French speakers without medical training, which we associate with patients. The experiments performed rely on machine learning algorithms and a set of 24 features. The medical words studied are provided by an existing medical terminology.

Linguistic data and their preparation
The linguistic data are obtained from the medical terminology Snomed International (Côté, 1996). This terminology's aim is to describe the whole medical field. It contains 151,104 medical terms structured into eleven semantic axes such as dis-1 http://www.cs.york.ac.uk/semeval-2012/ orders and abnormalities, procedures, chemical products, living organisms, anatomy, social status, etc. We keep here five axes related to the main medical notions (disorders, abnormalities, procedures, functions, anatomy). The objective is not to consider axes such as chemical products (trisulfure d'hydrogène (hydrogen sulfide)) and living organisms (Sapromyces, Acholeplasma laidlawii) that group very specific terms hardly known by laymen. The 104,649 selected terms are tokenized and segmented into words (or tokens) to obtain 29,641 unique words: trisulfure d'hydrogène gives three words (trisulfure, de, hydrogène). This dataset contains compounds (abdominoplastie (abdominoplasty), dermabrasion (dermabrasion)), constructed (cardiaque (cardiac), acineux (acinic), lipoïde (lipoid)) and simple (acné (acne), fragment (fragment)) words. These data are annotated by three speakers 25-40 year-old, without medical training, but with linguistic background. We expect the annotators to represent the average knowledge of medical words amongst the population as a whole. The annotators are presented with a list of terms and asked to assign each word to one of the three categories: (1) I can understand the word; (2) I am not sure about the meaning of the word; (3) I cannot understand the word. The assumption is that the words, which are not understandable by the annotators, are also difficult to understand by patients. These manual annotations correspond to the reference data (Table 1).

Methodology
The proposed method has two aspects: generation of the features associated to the analyzed words and a machine learning system. The main research question is whether the NLP methods can distinguish between understandable and nonunderstandable medical words and whether they can diagnose these two categories.

Generation of the features
We exploit 24 linguistic and extra-linguistic features related to general and specialized languages. The features are computed automatically, and can be grouped into ten classes: Syntactic categories. Syntactic categories and lemmas are computed by TreeTagger (Schmid, 1994) and then checked by Flemm (Namer, 2000). The syntactic categories are assigned to words within the context of their terms. If a given word receives more than one category, the most frequent one is kept as feature. Among the main categories we find for instance nouns, adjectives, proper names, verbs and abbreviations.
Presence of words in reference lexica. We exploit two reference lexica of the French language: TLFi 2 and lexique.org 3 . TLFi is a dictionary of the French language covering XIX and XX centuries. It contains almost 100,000 entries. lexique.org is a lexicon created for psycholinguistic experiments. It contains over 135,000 entries, among which inflectional forms of verbs, adjectives and nouns. It contains almost 35,000 lemmas.
Frequency of words through a non specialized search engine. For each word, we query the Google search engine in order to know its frequency attested on the web.
Frequency of words in the medical terminology. We also compute the frequency of words in the medical terminology Snomed International.
Number and types of semantic categories associated to words. We exploit the information on the semantic categories of Snomed International.
Length of words in number of their characters and syllables. For each word, we compute the number of its characters and syllables.
Number of bases and affixes. Each lemma is analyzed by the morphological analyzer Dérif (Namer and Zweigenbaum, 2004), adapted to the treatment of medical words. It performs the decomposition of lemmas into bases and affixes known in its database and it provides also semantic explanation of the analyzed lexemes. We exploit the morphological decomposition information (number of affixes and bases).
Initial and final substrings of the words. We compute the initial and final substrings of different length, from three to five characters.
Number and percentage of consonants, vowels and other characters. We compute the number and the percentage of consonants, vowels and other characters (i.e., hyphen, apostrophe, comas).
Classical readability scores. We apply two classical readability measures: Flesch (Flesch, 1948) and its variant Flesch-Kincaid (Kincaid et al., 1975). Such measures are typically used for evaluating the difficulty level of a text. They exploit surface characteristics of words (number of characters and/or syllables) and normalize these values with specifically designed coefficients.

Machine learning system
The machine learning algorithms are used to study whether they can distinguish between words understandable and non-understandable by laymen and to study the importance of various features for the task. The functioning of machine learning algorithms is based on a set of positive and negative examples of the data to be processed, which have to be described with suitable features such as those presented above. The algorithms can then detect the regularities within the training dataset to generate a model, and apply the generated model to process new unseen data. We apply various algorithms available within the WEKA (Witten and Frank, 2005) platform.
The annotations provided by the three annotators constitute our reference data. We use on the whole five reference datasets (Table 1): 3 sets of separate annotations provided by the three annotators (29,641 words each); 1 unanimity set, on which all the annotators agree (n=22,925); 1 majority set, for which we can compute the majority agreement (n=28,763). By definition, the two last datasets should present a better coherence and less annotation ambiguity because some ambiguities have been resolved by unanimity or by majority vote.

Evaluation
The inter-annotator agreement is computed with the Cohen's Kappa (Cohen, 1960), applied to pairs of annotators, which values are then leveraged to obtain the unique average value; and Fleiss' Kappa (Fleiss and Cohen, 1973), suitable for processing data provided by more than two annotators. The interpretation of the scores are for instance (Landis and Koch, 1977): substantial agreement between 0.61 and 0.80, almost perfect agreement between 0.81 and 1.00.
With machine learning, we perform a ten-fold cross-validation, which means that the evaluation test is performed ten times on different randomly generated test sets (1/10 of the whole dataset), while the remaining 9/10 of the whole dataset is used for training the algorithm and creating the model. In this way, each word is used during the test step. The success of the applied algorithms is evaluated with three classical measures: R recall, P precision and F F-measure. In the perspective of our work, these measures allow evaluating the suitability of the methodology to the distinction between understandable and non-understandable words and the relevance of the chosen features.
The baseline corresponds to the assignment of words to the biggest category, e.g., I cannot understand, which represents 66 to 74%, according to datasets. We can also compute the gain, which is the effective improvement of performance P given the baseline BL (Rittman, 2008): P −BL 1−BL .
6 Automatic analysis of understandability of medical words:

Results and Discussion
We address the following aspects: annotations (inter-annotator agreement, assignment of words to three categories), quantitative results provided by the machine learning algorithms, impact of the individual features on the distinction between categories, and usefulness of the method.

Annotations and inter-annotator agreement
The time needed for performing the manual reference annotations depends on annotators and ranges from 3 to 6 weeks. The annotation results presented in Table 1 indicate that the annotators 1 and 2 often provide similar results on their understanding of the medical words, while for the third annotator the task appears to be more difficult as he indicates globally a higher number of nonunderstandable words. The non-understandable words are the most frequent for all annotators and cover 66 to 70% of the whole dataset. The interannotator agreement shows substantial agreement: Fleiss' Kappa 0.735 and Cohen's Kappa 0.736. This is a very good result, especially when working with linguistic data for which the agreement is usually difficult to obtain. The evolution of annotations per category (Figure 1), such as provided by the annotators, can dis- tinguish easily between the three categories: (1) the most frequently chosen category is I cannot understand and it grows rapidly with new words; (2) the next most frequently chosen category is I can understand, although it grows more slowly; (3) the third category, which gathers the words on which the annotators show some hesitation, is very small. Given the proximity between the lines in each category, we can conclude that the annotators have similar difficulties in understanding the words from the dataset.  We tested several machine learning algorithms to discover which of them are the most suitable to the task at hand. In Table 2, with results computed on the majority dataset, we can observe that the algorithms provide with similar performance (between 0.85 and 0.90 P and R). In the remaining of the paper, we present results obtained with J48 (Quinlan, 1993). Table 3 shows P, R and F values for the five datasets: three annotators, majority and unanimity datasets. We can observe that, among the three annotators, it is easier to reproduce the annotations of the third annotator: we gain then 0.040 with F comparing to the two other annotators. The results become even better with the majority dataset (F=0.881), and reach F up to 0.947 on the unanimity dataset. As we expected, these two last datasets present less annotation ambiguity. The best categorization results are observed with I can understand and I cannot understand categories, while the I am not sure category is poorly managed by machine learning algorithms. Because this category is very small, the average performance obtained on all three categories remains high.  Table 3: J48 performance obtained on five datasets (A1, A2, A3, unanimity and majority).
In Table 4, we indicate the gain obtained by J48 compared to baseline: it ranges from 0.13 to 0.20, which is a good improvement, despite the category I am not sure that is difficult to discriminate. We also indicate the accuracy obtained on these datasets.  Table 4: Gain obtained for F by J48 on five datasets (A1, A2, A3, unanimity and majority).

Impact of individual features on understandability of medical words
To observe the impact of individual features, we did several iterations of experiments during which we incrementally increased the set of features: we started with one feature and then, at each iteration, we added one new feature, up to the 24 features available. We tried several random orders. The test presented here is done again on the majority dataset. Figures  • with the syntactic categories (POS-tags) alone we obtain P and R between 0.65 and 0.7. The performance is then close to the baseline performance. Often, proper names and abbreviations are associated with the non-understandable words. There is no difference between TreeTagger alone and the combination of TreeTagger with Flemm; • the initial and final substrings have positive impact. Among the final substrings, those with three and four characters (ie, -omie of -tomie (meaning cut), -phie of -rraphie (meaning stitch), -émie (meaning blood)) show positive impact, but substrings with five characters have negative impact and the previously gained improvement is lost. We may conclude that the five-character long final substrings may be too specific; • the length of words in characters have negative impact on the categorization results. There seems to be no strong link between this feature and the understanding of words: short and long words may be experienced as both understandable or not by annotators; • the presence of words in the reference lexica (TLFI and lexique.org) is beneficial to both precision and recall. We assume these lexica may represent common lexical competence of French speakers. For this reason, words that are present in these lexica, are also easier to understand; • the frequencies of words computed through a general search engine are beneficial.
Words with higher frequencies are often associated with a better understanding, although the frequency range depends on the words. For instance, coccyx (coccyx) or drain (drain) show high frequencies (1,800,000 and 175,000,000, respectively) and they belong indeed to the I can understand category. Words like colique (diarrhea) or clitoridien (clitoral) show lower frequencies (807,000 and 9,821, respectively), although they belong to the same category. On contrary, other words with quite high frequencies, like coagulase (coagulase), clivage (cleavage) or douve (fluke) (655,000, 1,350,000 and 1,030,000, respectively) are not understood by the annotators.
According to these experiments, our results point out that, among the most efficient features, we can find syntactic categories, presence of words in the reference lexica, frequencies of words on Google and three-and four-character end substring. In comparison to the existing studies, such as those presented during the SemEval challenge (Specia et al., 2012), we propose to exploit a more complete set of features, several of which rely on the NLP methods (e.g., syntactic tagging, morphological analysis). Especially the syntactic tagging appears to be salient for the task. In comparison to work done on general language data (Gala et al., 2013), our experiment shows better results (between 0.825 and 0.948 accuracy against 0.62 accuracy in the cited work), which indicates that specialized domains have indeed very specific words. Additional tests should be performed to obtain a more detailed impact of the features.

Usefulness of the method
We applied the proposed method to words from discharge summaries. The documents are preprocessed according to the same protocol and the words are assigned the same features as previously (section 5). The model learned on the unanimity set is applied. The results are shown in Figure 3. Among the words categorized as nonunderstandable (in red and underlined), we find: • abbreviations (NIHSS, OAP, NaCl, VNI); • technical medical terms (hypoesthésie (hypoesthesia), parésie (paresia), thrombolyse (thrombolysis), iatrogène (iatrogenic), oxygénothérapie (oxygen therapy), désaturation (desaturation)); Figure 3: Detection of non-understandable words within discharge summaries.
Notice that in other processed documents, other errors occur. For instance, misspelled words and words that miss accented characters (probleme instead of problème (problem), realise instead of réalisé (done), particularite instead particularité (particularity)) are problematic. Another type of errors may occur when technical words (e.g. prolapsus (prolapsus), paroxysme (paroxysm), tricuspide (tricuspid)) are considered as understandable. Besides, only isolated words are currently processed, which is the limitation of the current method. Still, consideration of complex medical terms, that convey more complex medical notions, should also be done. Such terms may indeed change the understanding of words, as in these examples: AVC ischémique (ischemic CVA (cerebrovascular accident)), embolie pulmonaire basale droite (right basal pulmonary embolism), désaturationà 83 % (desaturation at 83%), anticoagulation curative (curative anticoagulation). In the same way, numerical values may also arise misunderstanding of medical information. Processing of these additional aspects (inflected and constructed forms of words, hyphenated or misspelled words, complex terms composed with several words and numerical values) is part of the future work.

Limitations of the current study
We proposed several experiments for analyzing the understandability of medical words. We tried to analyze these data from different points of view to get a more complete picture. Still, there are some limitations. These are mainly related to the linguistic data and to their preparation.
The whole set of the analyzed words is large: almost 30,000 entries. We assume it is possible that annotations provided may show some intra-annotator inconsistencies due for instance to the tiredness and instability of the annotators (for instance, when a given unknown morphological components is seen again and again, the meaning of this component may be deduced by the annotator). Nevertheless, in our daily life, we are also confronted to the medical language (our personal health or health of family or friend, TV and radio broadcast, various readings of newspapers and novels) and then, it is possible that the new medical notions may be learned during the annotation period of the words, which lasted up to four weeks. Nevertheless, the advantage of the data we have built is that the whole set is completely annotated by each annotator.
When computing the features of the words, we have favored those, which are computed at the word level. In the future work, it may be interesting to take into account features computed at the level of morphological components or of complex terms. The main question will be to decide how such features can be combined all together.
The annotators involved in the study have a training in linguistics, although their relation with the medical field is poor: they have no specific health problems and no expertise in medical terminology. We expect they may represent the average level of patients with moderate health literacy. Nevertheless, the observed results may remain specific to the category of young people with linguistic training. Additional experiments are required to study this aspect better.

Conclusion and Future research
We proposed a study of words from the medical field, which are manually annotated as understandable, non-understandable and possibly understandable to laymen. The proposed approach is based on machine learning and a set with 24 features. Among the features, which appear to be salient for the diagnosis of understandable words, we find for instance the presence of words in the reference lexica, their syntactic categories, their final substring, and their frequencies on the web. Several features and their combinations can be distinguished, which shows that the understandability of words is a complex notion, which involves several linguistic and extra-linguistic criteria.
The avenue for future research includes for instance the exploitation of corpora, while currently we use features computed out of context. We assume indeed that corpora may provide additional relevant information (semantic or statistical) for the task aimed in this study. Additional aspects related to the processing of documents (inflected and constructed forms of words, hyphenated or misspelled words, complex terms composed with several words and numerical values) is another perspective. Besides, the classical readability measures exploited have been developed for the processing of English language. Working with French-language data, we should use measures, which are adapted to this language (Kandel and Moles, 1958;Henry, 1975). In addition, we can also explore various perspectives, which appear from the current limitations, such as computing and using features computed at different levels (morphological components, words and complex terms), applying other classical readability measures adapted to the French language, and adding new reference annotations provided by laymen from other social-professional categories. RA Côté, 1996