<?xml version="1.0" encoding="UTF-8" ?>
<volume id="W17">
  <paper id="8000" href="https://doi.org/10.26615/978-954-452-044-1_">
    <title>Proceedings of the Biomedical NLP Workshop associated with RANLP 2017</title>
    <editor>Svetla Boytcheva</editor>
    <editor>Kevin Bretonnel Cohen</editor>
    <editor>Guergana Savova</editor>
    <editor>Galia Angelova</editor>
    <month>September</month>
    <year>2017</year>
    <address>Varna, Bulgaria</address>
    <publisher>INCOMA Ltd.</publisher>
    <doi>10.26615/978-954-452-044-1_</doi>
    <url>https://doi.org/10.26615/978-954-452-044-1_</url>
    <bibtype>book</bibtype>
    <bibkey>BioNLP:2017</bibkey>
  </paper>

  <paper id="8001" href="https://doi.org/10.26615/978-954-452-044-1_001">
    <title>Document retrieval and question answering in medical documents. A large-scale corpus challenge.</title>
    <author><first>Curea</first><last>Eric</last></author>
    <booktitle>Proceedings of the Biomedical NLP Workshop associated with RANLP 2017</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Varna, Bulgaria</address>
    <publisher>INCOMA Ltd.</publisher>
    <pages>1&#8211;7</pages>
    <doi>10.26615/978-954-452-044-1_001</doi>
    <url>https://doi.org/10.26615/978-954-452-044-1_001</url>
    <abstract>Whenever employed on large datasets, information retrieval works by isolating a
	subset of documents from the larger dataset and then proceeding with low-level
	processing of the text. This is usually carried out by means of adding
	index-terms to each document in the collection. In this paper we deal with
	automatic document classification and index-term detection applied on
	large-scale medical corpora. In our methodology we employ a linear classifier
	and we test our results on the BioASQ training corpora, which is a collection
	of 12 million MeSH-indexed medical abstracts. We cover both term-indexing,
	result retrieval and result ranking based on distributed word representations.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>eric:2017:BioNLP</bibkey>
  </paper>

  <paper id="8002" href="https://doi.org/10.26615/978-954-452-044-1_002">
    <title>Adapting the TTL Romanian POS Tagger to the Biomedical Domain</title>
    <author><first>Maria</first><last>Mitrofan</last></author>
    <author><first>Radu</first><last>Ion</last></author>
    <booktitle>Proceedings of the Biomedical NLP Workshop associated with RANLP 2017</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Varna, Bulgaria</address>
    <publisher>INCOMA Ltd.</publisher>
    <pages>8&#8211;14</pages>
    <doi>10.26615/978-954-452-044-1_002</doi>
    <url>https://doi.org/10.26615/978-954-452-044-1_002</url>
    <abstract>This paper presents the adaptation of the Hidden Markov Models-based TTL
	part-of-speech tagger to the biomedical domain. TTL is a text processing
	platform that performs sentence splitting, tokenization, POS tagging, chunking 
	and Named Entity Recognition (NER) for a number of languages, including
	Romanian. The POS tagging accuracy obtained by the TTL POS tagger exceeds 97\%
	when TTL's baseline model is updated with training information from a Romanian
	biomedical corpus. This corpus is developed in the context of the CoRoLa (a
	reference corpus for the contemporary Romanian language) project. Informative
	description and statistics of the Romanian biomedical corpus are also provided.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>mitrofan-ion:2017:BioNLP</bibkey>
  </paper>

  <paper id="8003" href="https://doi.org/10.26615/978-954-452-044-1_003">
    <title>Discourse-Wide Extraction of Assay Frames from the Biological Literature</title>
    <author><first>Dayne</first><last>Freitag</last></author>
    <author><first>Paul</first><last>Kalmar</last></author>
    <author><first>Eric</first><last>Yeh</last></author>
    <booktitle>Proceedings of the Biomedical NLP Workshop associated with RANLP 2017</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Varna, Bulgaria</address>
    <publisher>INCOMA Ltd.</publisher>
    <pages>15&#8211;23</pages>
    <doi>10.26615/978-954-452-044-1_003</doi>
    <url>https://doi.org/10.26615/978-954-452-044-1_003</url>
    <abstract>We consider the problem of populating multi-part knowledge frames
	  from textual information distributed over multiple sentences in a
	  document.  We present a corpus constructed by aligning papers from
	  the cellular signaling literature to a collection of approximately
	  50,000 reference frames curated by hand as part of a decade-long
	  project. We present and evaluate two approaches to the challenging
	  problem of reconstructing these frames, which formalize biological
	  assays described in the literature.  One approach is based on
	  classifying candidate records nominated by sentence-local entity
	  co-occurrence. In the second approach, we introduce a novel virtual register
	  machine traverses an article and generates frames, trained on our
	  reference data. Our evaluations show that success in the
	  task ultimately hinges on an integration of evidence spread across
	  the discourse.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>freitag-kalmar-yeh:2017:BioNLP</bibkey>
  </paper>

  <paper id="8004" href="https://doi.org/10.26615/978-954-452-044-1_004">
    <title>Classification based extraction of numeric values from clinical narratives</title>
    <author><first>Maximilian</first><last>Zubke</last></author>
    <booktitle>Proceedings of the Biomedical NLP Workshop associated with RANLP 2017</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Varna, Bulgaria</address>
    <publisher>INCOMA Ltd.</publisher>
    <pages>24&#8211;31</pages>
    <doi>10.26615/978-954-452-044-1_004</doi>
    <url>https://doi.org/10.26615/978-954-452-044-1_004</url>
    <abstract>The robust extraction of numeric values from clinical narratives is a well
	known problem in clinical data warehouses.
	In this paper we describe a dynamic and domain-independent  approach to deliver
	numerical described values from clinical narratives. In contrast to alternative
	systems, we neither use manual defined rules nor any kind of ontologies or
	nomenclatures. Instead we propose a topic-based system, that tackles the
	information extraction as a text classification problem. Hence we use machine
	learning to identify the crucial context features of a topic-specific numeric
	value by a given set of example sentences, so that the manual effort reduces to
	the selection of appropriate sample sentences.
	We describe context features of a certain numeric value by term frequency
	vectors which are generated by multiple document segmentation procedures. Due
	to this simultaneous segmentation approaches, there can be more than one
	context vector for a numeric value. In those cases, we choose the context
	vector with the highest classification confidence and suppress the rest.
	To test our approach, we used a dataset from a german hospital containing
	12\,743 narrative reports about laboratory results of Leukemia patients. We
	used Support Vector Machines (SVM) for classification and achieved an average
	accuracy of 96\% on a manually labeled                    subset of 2073 documents,
	using
	10-fold
	cross validation.  This is a significant improvement over an alternative rule
	based system.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>zubke:2017:BioNLP</bibkey>
  </paper>

  <paper id="8005" href="https://doi.org/10.26615/978-954-452-044-1_005">
    <title>Understanding of unknown medical words</title>
    <author><first>Natalia</first><last>Grabar</last></author>
    <author><first>Thierry</first><last>Hamon</last></author>
    <booktitle>Proceedings of the Biomedical NLP Workshop associated with RANLP 2017</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Varna, Bulgaria</address>
    <publisher>INCOMA Ltd.</publisher>
    <pages>32&#8211;41</pages>
    <doi>10.26615/978-954-452-044-1_005</doi>
    <url>https://doi.org/10.26615/978-954-452-044-1_005</url>
    <abstract>We assume that unknown words with internal structure (affixed words or
	compounds) can provide speakers with linguistic cues as for their meaning, and
	thus help their decoding and understanding. To verify this hypothesis, we
	propose to work with a set of French medical words. These words are annotated
	by five annotators. Then, two kinds of analysis are performed: analysis of the
	evolution of understandable and non-understandable words (globally and
	according to some suffixes) and analysis of clusters created with unsupervised
	algorithms on basis of linguistic and extra-linguistic features of the studied
	words. Our results suggest that, according to linguistic sensitivity of
	annotators, technical words can be decoded and become understandable. As for
	the clusters, some of them distinguish between understandable and
	non-understandable words. Resources built in this work will be made freely
	available for the research purposes.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>grabar-hamon:2017:BioNLP</bibkey>
  </paper>

  <paper id="8006" href="https://doi.org/10.26615/978-954-452-044-1_006">
    <title>Entity-Centric Information Access with Human in the Loop for the Biomedical Domain</title>
    <author><first>Seid Muhie</first><last>Yimam</last></author>
    <author><first>Steffen</first><last>Remus</last></author>
    <author><first>Alexander</first><last>Panchenko</last></author>
    <author><first>Andreas</first><last>Holzinger</last></author>
    <author><first>Chris</first><last>Biemann</last></author>
    <booktitle>Proceedings of the Biomedical NLP Workshop associated with RANLP 2017</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Varna, Bulgaria</address>
    <publisher>INCOMA Ltd.</publisher>
    <pages>42&#8211;48</pages>
    <doi>10.26615/978-954-452-044-1_006</doi>
    <url>https://doi.org/10.26615/978-954-452-044-1_006</url>
    <abstract>In this paper, we describe the concept of entity-centric information access for
	the biomedical domain. With entity recognition technologies approaching
	acceptable levels of accuracy, we put forward a paradigm of document browsing
	and searching where the entities of the domain and their relations are
	explicitly modeled to provide users the possibility of collecting exhaustive
	information on relations of interest. We describe three working prototypes
	along these lines: NEW/S/LEAK, which was developed for investigative
	journalists who need a quick overview of large leaked document collections;
	STORYFINDER, which is a personalized organizer for information found in web
	pages that allows adding entities as well as relations, and is capable of
	personalized information management; and adaptive annotation capabilities of
	WEBANNO, which is a general-purpose linguistic annotation tool. We will discuss
	future steps towards the adaptation of these tools to biomedical data, which is
	subject to a recently started project on biomedical knowledge acquisition. A
	key difference to other approaches is the centering around the user in a
	Human-in-the-Loop machine learning approach, where users define and extend
	categories and enable the system to improve via feedback and interaction.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>yimam-EtAl:2017:BioNLP</bibkey>
  </paper>

  <paper id="8007" href="https://doi.org/10.26615/978-954-452-044-1_007">
    <title>One model per entity: using hundreds of machine learning models to recognize and normalize biomedical names in text</title>
    <author><first>Victor</first><last>Bellon</last></author>
    <author><first>Raul</first><last>Rodriguez-Esteban</last></author>
    <booktitle>Proceedings of the Biomedical NLP Workshop associated with RANLP 2017</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Varna, Bulgaria</address>
    <publisher>INCOMA Ltd.</publisher>
    <pages>49&#8211;54</pages>
    <doi>10.26615/978-954-452-044-1_007</doi>
    <url>https://doi.org/10.26615/978-954-452-044-1_007</url>
    <abstract>We explored a new approach to named entity recognition based on hundreds of
	machine learning models, each trained to distinguish a single entity, and
	showed its application to gene name identification (GNI). The rationale for our
	approach, which we named &#x201c;one model per entity&#x201d; (OMPE), was that increasing
	the number of models would make the learning task easier for each individual
	model. Our training strategy leveraged freely-available database annotations
	instead of manually-annotated corpora. While its performance in our
	proof-of-concept was disappointing, we believe that there is enough room for
	improvement that such approaches could reach competitive performance while
	eliminating the cost of creating costly training corpora.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>bellon-rodriguezesteban:2017:BioNLP</bibkey>
  </paper>

  <paper id="8008" href="https://doi.org/10.26615/978-954-452-044-1_008">
    <title>Towards Confidence Estimation for Typed Protein-Protein Relation Extraction</title>
    <author><first>Camilo</first><last>Thorne</last></author>
    <author><first>Roman</first><last>Klinger</last></author>
    <booktitle>Proceedings of the Biomedical NLP Workshop associated with RANLP 2017</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Varna, Bulgaria</address>
    <publisher>INCOMA Ltd.</publisher>
    <pages>55&#8211;63</pages>
    <doi>10.26615/978-954-452-044-1_008</doi>
    <url>https://doi.org/10.26615/978-954-452-044-1_008</url>
    <abstract>Systems which build on top of information extraction are typically
	  challenged to extract knowledge that, while correct, is not yet well-known. 
	  We hypothesize that a good
	  confidence measure for relational information has the property that
	  such interesting information is found between information
	  extracted with very high confidence and very low confidence.
	  We discuss confidence estimation for the domain of biomedical
	  protein-protein relation discovery in biomedical literature. As
	  facts reported in papers take some time to be validated and recorded
	  in biomedical databases, such task gives rise to large quantities of
	  unknown but potentially true candidate relations.  It is thus
	  important to rank them based on supporting evidence rather than
	  discard them.
	  In this paper, we discuss this task and propose different approaches
	  for confidence estimation and a pipeline to evaluate such
	  methods. We show that the most straight-forward approach, a
	  combination of different confidence measures from pipeline modules
	  seems not to work well. We discuss this negative result and pinpoint
	  potential future research directions.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>thorne-klinger:2017:BioNLP</bibkey>
  </paper>

  <paper id="8009" href="https://doi.org/10.26615/978-954-452-044-1_009">
    <title>Identification of Risk Factors in Clinical Texts through Association Rules</title>
    <author><first>Svetla</first><last>Boytcheva</last></author>
    <author><first>Ivelina</first><last>Nikolova</last></author>
    <author><first>Galia</first><last>Angelova</last></author>
    <author><first>Zhivko</first><last>Angelov</last></author>
    <booktitle>Proceedings of the Biomedical NLP Workshop associated with RANLP 2017</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Varna, Bulgaria</address>
    <publisher>INCOMA Ltd.</publisher>
    <pages>64&#8211;72</pages>
    <doi>10.26615/978-954-452-044-1_009</doi>
    <url>https://doi.org/10.26615/978-954-452-044-1_009</url>
    <abstract>We describe a method which extracts Association Rules from texts in order to
	recognise verbalisations of risk factors. Usually some basic vocabulary about
	risk factors is known but medical conditions are expressed in clinical
	narratives with much higher variety. We propose an approach for data-driven
	learning of specialised  medical vocabulary which, once collected, enables
	early alerting of potentially affected patients. The method is illustrated by
	experimens with clinical records of patients with Chronic Obstructive Pulmonary
	Disease (COPD) and comorbidity of CORD, Diabetes Melitus and Schizophrenia. Our
	input data come from the Bulgarian Diabetic Register, which is built using a
	pseudonymised collection of outpatient records for about 500,000 diabetic
	patients. The generated Association Rules for CORD are analysed in the context
	of demographic, gender, and age information. Valuable anounts of meaningful
	words, signalling risk factors, are discovered with high precision and
	confidence.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>boytcheva-EtAl:2017:BioNLP1</bibkey>
  </paper>

  <paper id="8010" href="https://doi.org/10.26615/978-954-452-044-1_010">
    <title>POMELO: Medline corpus with manually annotated food-drug interactions</title>
    <author><first>Thierry</first><last>Hamon</last></author>
    <author><first>Vincent</first><last>Tabanou</last></author>
    <author><first>Fleur</first><last>Mougin</last></author>
    <author><first>Natalia</first><last>Grabar</last></author>
    <author><first>Frantz</first><last>Thiessard</last></author>
    <booktitle>Proceedings of the Biomedical NLP Workshop associated with RANLP 2017</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Varna, Bulgaria</address>
    <publisher>INCOMA Ltd.</publisher>
    <pages>73&#8211;80</pages>
    <doi>10.26615/978-954-452-044-1_010</doi>
    <url>https://doi.org/10.26615/978-954-452-044-1_010</url>
    <abstract>When patients take more than one medication, they may be at risk of drug
	interactions, which means that a given drug
	can cause unexpected effects when taken in combination with other drugs.
	Similar
	effects may occur when drugs are taken together with some food or beverages.
	For instance, grapefruit has interactions with
	several drugs, because its active ingredients inhibit enzymes involved in the
	drugs metabolism and can then cause an excessive dosage of these drugs. Yet,
	information on food/drug interactions is poorly researched. The current
	research is mainly provided by the medical domain and a
	very tentative work is provided by computer sciences and NLP domains. One
	factor that motivates the research is related to the availability of the
	annotated corpora and the reference data. The purpose of our work is to
	describe the rationale and approach for creation and annotation of scientific
	corpus with
	information on food/drug interactions. This corpus contains 639 MEDLINE
	citations (titles and abstracts), corresponding to 5,752
	sentences. It is manually annotated by two
	experts. The corpus is named POMELO. This annotated corpus will be made
	available for the research purposes.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>hamon-EtAl:2017:BioNLP</bibkey>
  </paper>

  <paper id="8011" href="https://doi.org/10.26615/978-954-452-044-1_011">
    <title>Annotation of Clinical Narratives in Bulgarian language</title>
    <author><first>Ivajlo</first><last>Radev</last></author>
    <author><first>Kiril</first><last>Simov</last></author>
    <author><first>Galia</first><last>Angelova</last></author>
    <author><first>Svetla</first><last>Boytcheva</last></author>
    <booktitle>Proceedings of the Biomedical NLP Workshop associated with RANLP 2017</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Varna, Bulgaria</address>
    <publisher>INCOMA Ltd.</publisher>
    <pages>81&#8211;87</pages>
    <doi>10.26615/978-954-452-044-1_011</doi>
    <url>https://doi.org/10.26615/978-954-452-044-1_011</url>
    <abstract>In this paper we describe annotation process  of clinical texts with
	morphosyntactic and semantic information. The corpus contains 1,300 discharge
	letters in Bulgarian language for patients with Endocrinology and Metabolic
	disorders. The annotated corpus will be used as a Gold standard for information
	extraction evaluation of test corpus of 6,200 discharge letters. The annotation
	is performed within Clark system &#8211;- an XML Based System For Corpora
	Development. It provides mechanism for semi-automatic annotation first running
	a pipeline for Bulgarian morphosyntactic annotation and a cascaded regular
	grammar for semantic annotation is run, then rules for cleaning of frequent
	errors are applied. At the end the result is manually checked. At the end we
	hope also to be able to adapted the morphosyntactic tagger to the domain of
	clinical narratives as well.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>boytcheva-EtAl:2017:BioNLP2</bibkey>
  </paper>

</volume>

