<?xml version="1.0" encoding="UTF-8" ?>
<volume id="W16">
  <paper id="5100">
    <title>Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)</title>
    <editor>Sophia Ananiadou</editor>
    <editor>Riza Batista-Navarro</editor>
    <editor>Kevin Bretonnel Cohen</editor>
    <editor>Dina Demner-Fushman</editor>
    <editor>Paul Thompson</editor>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <url>http://aclweb.org/anthology/W16-51</url>
    <bibtype>book</bibtype>
    <bibkey>BioTxtM2016:2016</bibkey>
  </paper>

  <paper id="5101">
    <title>Cancer Hallmark Text Classification Using Convolutional Neural Networks</title>
    <author><first>Simon</first><last>Baker</last></author>
    <author><first>Anna</first><last>Korhonen</last></author>
    <author><first>Sampo</first><last>Pyysalo</last></author>
    <booktitle>Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>1&#8211;9</pages>
    <url>http://aclweb.org/anthology/W16-5101</url>
    <abstract>Methods based on deep learning approaches have recently achieved
	state-of-the-art performance in a range of machine learning tasks and are
	increasingly applied to natural language processing (NLP). Despite strong
	results in various established NLP tasks involving general domain texts, there
	is only limited work applying these models to biomedical NLP. In this paper, we
	consider a Convolutional Neural Network (CNN) approach to biomedical text
	classification. Evaluation using a recently introduced cancer domain dataset
	involving the categorization of documents according to the well-established
	hallmarks of cancer shows that a basic CNN model can achieve a level of
	performance competitive with a Support Vector Machine (SVM) trained using
	complex manually engineered features optimized to the task. We further show
	that simple modifications to the CNN hyperparameters, initialization, and
	training process allow the model to notably outperform the SVM, establishing a
	new state of the art result at this task. We make all of the resources and
	tools introduced in this study available under open licenses from
	https://cambridgeltl.github.io/cancer-hallmark-cnn/ .</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>baker-korhonen-pyysalo:2016:BioTxtM2016</bibkey>
  </paper>

  <paper id="5102">
    <title>Learning Orthographic Features in Bi-directional LSTM for Biomedical Named Entity Recognition</title>
    <author><first>Nut</first><last>Limsopatham</last></author>
    <author><first>Nigel</first><last>Collier</last></author>
    <booktitle>Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>10&#8211;19</pages>
    <url>http://aclweb.org/anthology/W16-5102</url>
    <abstract>End-to-end neural network models for named entity recognition (NER) have shown
	to achieve effective performances on general domain datasets (e.g.\ newswire),
	without requiring additional hand-crafted features. However, in biomedical
	domain, recent studies have shown that hand-engineered features (e.g.\
	orthographic features) should be used to attain effective performance, due to
	the complexity of biomedical terminology (e.g.\ the use of acronyms and complex
	gene names). In this work, we propose a novel approach that allows a neural
	network model based on a long short-term memory (LSTM) to automatically learn
	orthographic features and incorporate them into a model for biomedical NER.
	Importantly, our bi-directional LSTM model learns and leverages orthographic
	features on an end-to-end basis. We evaluate our approach by comparing against
	existing neural network models for NER using three well-established biomedical
	datasets. Our experimental results show that the proposed approach consistently
	outperforms these strong baselines across all of the three datasets.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>limsopatham-collier:2016:BioTxtM2016</bibkey>
  </paper>

  <paper id="5103">
    <title>Building Content-driven Entity Networks for Scarce Scientific Literature using Content Information</title>
    <author><first>Reinald Kim</first><last>Amplayo</last></author>
    <author><first>Min</first><last>Song</last></author>
    <booktitle>Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>20&#8211;29</pages>
    <url>http://aclweb.org/anthology/W16-5103</url>
    <abstract>This paper proposes several network construction methods for collections of
	scarce scientific literature data. We define scarcity as lacking in value and
	in volume. Instead of using the paper's metadata to construct several kinds of
	scientific networks, we use the full texts of the articles and automatically
	extract the entities needed to construct the networks. Specifically, we present
	seven kinds of networks using the proposed construction methods: co-occurrence
	networks for author, keyword, and biological entities, and citation networks
	for author, keyword, biological, and topic entities. We show two case studies
	that applies our proposed methods: CADASIL, a rare yet the most common form of
	hereditary stroke disorder, and Metformin, the first-line medication to the
	type 2 diabetes treatment. We apply our proposed method to four different
	applications for evaluation: finding prolific authors, finding important
	bio-entities, finding meaningful keywords, and discovering influential topics.
	The results show that the co-occurrence and citation networks constructed using
	the proposed method outperforms the traditional-based networks. We also compare
	our proposed networks to traditional citation networks constructed using enough
	data and infer that even with the same amount of enough data, our methods
	perform comparably or better than the traditional methods.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>amplayo-song:2016:BioTxtM2016</bibkey>
  </paper>

  <paper id="5104">
    <title>Named Entity Recognition in Swedish Health Records with Character-Based Deep Bidirectional LSTMs</title>
    <author><first>Simon</first><last>Almgren</last></author>
    <author><first>Sean</first><last>Pavlov</last></author>
    <author><first>Olof</first><last>Mogren</last></author>
    <booktitle>Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>30&#8211;39</pages>
    <url>http://aclweb.org/anthology/W16-5104</url>
    <abstract>We propose an approach for named entity recognition in medical data, using a
	character-based
	deep bidirectional recurrent neural network. Such models can learn features and
	patterns based
	on the character sequence, and are not limited to a fixed vocabulary. This
	makes them very well
	suited for the NER task in the medical domain. Our experimental evaluation
	shows promising
	results, with a 60% improvement in F 1 score over the baseline, and our system
	generalizes well
	between different datasets.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>almgren-pavlov-mogren:2016:BioTxtM2016</bibkey>
  </paper>

  <paper id="5105">
    <title>Entity-Supported Summarization of Biomedical Abstracts</title>
    <author><first>Frederik</first><last>Schulze</last></author>
    <author><first>Mariana</first><last>Neves</last></author>
    <booktitle>Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>40&#8211;49</pages>
    <url>http://aclweb.org/anthology/W16-5105</url>
    <abstract>The increasing amount of biomedical information that is available for
	researchers and clinicians makes it harder to quickly find the right
	information. Automatic summarization of multiple texts can provide summaries
	specific to the user’s information needs. In this paper we look into the use
	named-entity recognition for graph-based summarization. We extend the LexRank
	algorithm with information about named entities and present EntityRank, a
	multi-document graph-based summarization algorithm that is solely based on
	named entities. We evaluate our system on a datasets of 1009 human written
	summaries provided by BioASQ and on 1974 gene summaries, fetched from the
	Entrez Gene database. The results show that the addition of named-entity
	information increases the performance of graph-based summarizers and that the
	EntityRank significantly outperforms the other methods with regard to the ROUGE
	measures.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>schulze-neves:2016:BioTxtM2016</bibkey>
  </paper>

  <paper id="5106">
    <title>Fully unsupervised low-dimensional representation of adverse drug reaction events through distributional semantics</title>
    <author><first>Alicia</first><last>P&#233;rez</last></author>
    <author><first>Arantza</first><last>Casillas</last></author>
    <author><first>Koldo</first><last>Gojenola</last></author>
    <booktitle>Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>50&#8211;59</pages>
    <url>http://aclweb.org/anthology/W16-5106</url>
    <abstract>Electronic health records show great variability since the same concept is
	often expressed with different terms,  either scientific  latin forms, common
	or lay variants and even vernacular naming. Deep learning enables 
	distributional representation of  terms in a vector-space, and therefore,
	related terms tend to be close in the vector space. Accordingly, embedding
	words through these vectors opens the way towards accounting for semantic
	relatedness through classical algebraic operations.  
	In this work we propose a simple though efficient unsupervised characterization
	of Adverse Drug Reactions (ADRs). This approach exploits the embedding
	representation of the terms involved in candidate ADR events, that is,
	drug-disease entity pairs. In brief, the ADRs are represented as vectors that
	link the drug  with the disease in their context through a recursive additive
	model.
	We discovered that a low-dimensional representation that makes use of the
	modulus and argument of the embedded representation of the ADR event shows 
	correlation with the manually annotated class. Thus, it can be derived that
	this characterization results in to be beneficial  for further classification
	tasks as predictive features.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>perez-casillas-gojenola:2016:BioTxtM2016</bibkey>
  </paper>

  <paper id="5107">
    <title>A Dataset for ICD-10 Coding of Death Certificates: Creation and Usage</title>
    <author><first>Thomas</first><last>Lavergne</last></author>
    <author><first>Aurelie</first><last>Neveol</last></author>
    <author><first>Aude</first><last>Robert</last></author>
    <author><first>Cyril</first><last>Grouin</last></author>
    <author><first>Gr&#233;goire</first><last>Rey</last></author>
    <author><first>Pierre</first><last>Zweigenbaum</last></author>
    <booktitle>Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>60&#8211;69</pages>
    <url>http://aclweb.org/anthology/W16-5107</url>
    <abstract>Very few datasets have been released for the evaluation of diagnosis coding
	with the International Classification of Diseases, and only one so far in a
	language other than English. This paper describes a large-scale dataset
	prepared from French death certificates, and the problems which needed to be
	solved to turn it into a dataset suitable for the application of machine
	learning and natural language processing methods of ICD-10 coding. The dataset
	includes the free-text statements written by medical doctors, the associated
	meta-data, the human coder-assigned codes for each statement, as well as the
	statement segments which supported the coder’s decision for each code. The
	dataset comprises 93,694 death certificates totalling 276,103 statements and
	377,677 ICD-10 code assignments (3,457 unique codes). It was made available for
	an international automated coding shared task, which attracted five
	participating teams. An extended version of the dataset will be used in a new
	edition of the shared task.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>lavergne-EtAl:2016:BioTxtM2016</bibkey>
  </paper>

  <paper id="5108">
    <title>A Corpus of Tables in Full-Text Biomedical Research Publications</title>
    <author><first>Tatyana</first><last>Shmanina</last></author>
    <author><first>Ingrid</first><last>Zukerman</last></author>
    <author><first>Ai Lee</first><last>Cheam</last></author>
    <author><first>Thomas</first><last>Bochynek</last></author>
    <author><first>Lawrence</first><last>Cavedon</last></author>
    <booktitle>Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>70&#8211;79</pages>
    <url>http://aclweb.org/anthology/W16-5108</url>
    <abstract>The development of text mining techniques for biomedical research literature
	has received increased attention in recent times. However, most of these
	techniques focus on prose, while much important biomedical data reside in
	tables. In this paper, we present a corpus created to serve as a gold standard
	for the development and evaluation of techniques for the automatic extraction
	of information from biomedical tables. We describe the guidelines used for
	corpus annotation and the manner in which they were developed. The high
	inter-annotator agreement achieved on the corpus, and the generic nature of our
	annotation approach, suggest that the developed guidelines can serve as a
	general framework for table annotation in biomedical and other scientific
	domains. The annotated corpus and the guidelines are available at
	http://www.csse.monash.edu.au/research/umnl/data/index.shtml.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>shmanina-EtAl:2016:BioTxtM2016</bibkey>
  </paper>

  <paper id="5109">
    <title>Supervised classification of end-of-lines in clinical text with no manual annotation</title>
    <author><first>Pierre</first><last>Zweigenbaum</last></author>
    <author><first>Cyril</first><last>Grouin</last></author>
    <author><first>Thomas</first><last>Lavergne</last></author>
    <booktitle>Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>80&#8211;88</pages>
    <url>http://aclweb.org/anthology/W16-5109</url>
    <abstract>In some plain text documents, end-of-line marks may or may not mark the
	boundary of a text unit (e.g., of a paragraph).  This vexing problem is likely
	to impact subsequent natural language processing components, but is seldom
	addressed in the literature.  We propose a method which uses no manual
	annotation to classify whether end-of-lines must actually be seen as simple
	spaces (soft line breaks) or as true text unit boundaries. This method, which
	includes self-training and co-training steps based on token and line length
	features, achieves 0.943 F-measure on a corpus of short e-books with controlled
	format, F=0.904 on a random sample of 24 clinical texts with soft line breaks,
	and F=0.898 on a larger set of mixed clinical texts which may or may not
	contain soft line breaks, a fairly high value for a method with no manual
	annotation.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>zweigenbaum-grouin-lavergne:2016:BioTxtM2016</bibkey>
  </paper>

  <paper id="5110">
    <title>BioDCA Identifier: A System for Automatic Identification of Discourse Connective and Arguments from Biomedical Text</title>
    <author><first>Sindhuja</first><last>Gopalan</last></author>
    <author><first>Sobha</first><last>Lalitha Devi</last></author>
    <booktitle>Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>89&#8211;98</pages>
    <url>http://aclweb.org/anthology/W16-5110</url>
    <abstract>This paper describes a Natural language processing system developed for
	automatic identification of explicit connectives, its sense and arguments.
	Prior work has shown that the difference in usage of connectives across corpora
	affects the cross domain connective identification task negatively. Hence the
	development of domain specific discourse parser has become indispensable. Here,
	we present a corpus annotated with discourse relations on Medline abstracts.
	Kappa score is calculated to check the annotation quality of our corpus. The
	previous works on discourse analysis in bio-medical data have concentrated only
	on the                          identification of connectives and hence we have
	developed
	an
	end-end
	parser for connective and argument identification using Conditional Random
	Fields algorithm. The type and sub-type of the connective sense is also
	identified. The results obtained are encouraging.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>gopalan-lalithadevi:2016:BioTxtM2016</bibkey>
  </paper>

  <paper id="5111">
    <title>Data, tools and resources for mining social media drug chatter</title>
    <author><first>Abeed</first><last>Sarker</last></author>
    <author><first>Graciela</first><last>Gonzalez</last></author>
    <booktitle>Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>99&#8211;107</pages>
    <url>http://aclweb.org/anthology/W16-5111</url>
    <abstract>Social media has emerged into a crucial resource for obtaining population-based
	signals for various public health monitoring and surveillance tasks, such as
	pharmacovigilance. There is an abundance of knowledge hidden within social
	media data, and the volume is growing. Drug-related chatter on social media can
	include user-generated information that can provide insights into public health
	problems such as abuse, adverse reactions, long-term effects, and multi-drug
	interactions. Our objective in this paper is to present to the biomedical
	natural language processing, data science, and public health communities data
	sets (annotated and unannotated), tools and resources that we have collected
	and created from social media. The data we present was collected from Twitter
	using the generic and brand names of drugs as keywords, along with their common
	misspellings. Following the collection of the data, annotation guidelines were
	created over several iterations, which detail important aspects of social media
	data annotation and can be used by future researchers for developing similar
	data sets. The annotation guidelines were followed to prepare data sets for
	text classification, information extraction and normalization. In this paper,
	we discuss the preparation of these guidelines, outline the data sets prepared,
	and present an overview of our state-of-the-art systems for data collection,
	supervised classification, and information extraction. In addition to the
	development of supervised systems for classification and extraction, we
	developed and released unlabeled data and language models. We discuss the
	potential uses of these language models in data mining and the large volumes of
	unlabeled data from which they were generated. We believe that the summaries
	and repositories we present here of our data, annotation guidelines, models,
	and tools will be beneficial to the research community as a single-point entry
	for all these resources, and will promote further research in this area.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>sarker-gonzalez:2016:BioTxtM2016</bibkey>
  </paper>

  <paper id="5112">
    <title>Detection of Text Reuse in French Medical Corpora</title>
    <author><first>Eva</first><last>D'hondt</last></author>
    <author><first>Cyril</first><last>Grouin</last></author>
    <author><first>Aurelie</first><last>Neveol</last></author>
    <author><first>Efstathios</first><last>Stamatatos</last></author>
    <author><first>Pierre</first><last>Zweigenbaum</last></author>
    <booktitle>Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>108&#8211;114</pages>
    <url>http://aclweb.org/anthology/W16-5112</url>
    <abstract>Electronic Health Records (EHRs) are increasingly available in modern health
	care institutions either through the direct creation of electronic documents in
	hospitals' health information systems, or through the digitization of
	historical paper records. Each EHR creation method yields the need for
	sophisticated text reuse detection tools in order to prepare the EHR
	collections for efficient secondary use relying on Natural Language Processing
	methods. Herein, we address the detection of two types of text reuse in French
	EHRs: 1) the detection of updated versions of the same document and 2) the
	detection of document duplicates that still bear surface differences due to OCR
	or de-identification processing. We present a robust text reuse detection
	method to automatically identify redundant document pairs in two French EHR
	corpora that achieves an overall macro F-measure of 0.68 and 0.60, respectively
	and correctly identifies all redundant document pairs of interest.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>dhondt-EtAl:2016:BioTxtM2016</bibkey>
  </paper>

  <paper id="5113">
    <title>Negation Detection in Clinical Reports Written in German</title>
    <author><first>Viviana</first><last>Cotik</last></author>
    <author><first>Roland</first><last>Roller</last></author>
    <author><first>Feiyu</first><last>Xu</last></author>
    <author><first>Hans</first><last>Uszkoreit</last></author>
    <author><first>Klemens</first><last>Budde</last></author>
    <author><first>Danilo</first><last>Schmidt</last></author>
    <booktitle>Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>115&#8211;124</pages>
    <url>http://aclweb.org/anthology/W16-5113</url>
    <abstract>An important subtask in clinical text mining tries to identify whether a
	clinical finding is expressed as present, absent or unsure in a text. This work
	presents a system for detecting mentions of clinical findings that are negated
	or just speculated. The system has been applied to two different types of
	German clinical texts: clinical notes and discharge summaries. Our approach is
	built on top of NegEx, a well known algorithm for identifying non-factive
	mentions of medical findings. In this work, we adjust a previous adaptation of
	NegEx to German and evaluate the system on our data to detect negation and
	speculation. The results are compared to a baseline algorithm and are analyzed
	for both types of clinical documents. Our system achieves an F1-Score above 0.9
	on both types of reports.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>cotik-EtAl:2016:BioTxtM2016</bibkey>
  </paper>

  <paper id="5114">
    <title>Scoring Disease-Medication Associations using Advanced NLP, Machine Learning, and Multiple Content Sources</title>
    <author><first>Bharath</first><last>Dandala</last></author>
    <author><first>Murthy</first><last>Devarakonda</last></author>
    <author><first>Mihaela</first><last>Bornea</last></author>
    <author><first>Christopher</first><last>Nielson</last></author>
    <booktitle>Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>125&#8211;133</pages>
    <url>http://aclweb.org/anthology/W16-5114</url>
    <abstract>Effective knowledge resources are critical for developing successful clinical
	decision support systems that alleviate the cognitive load on physicians in
	patient care. In this paper, we describe two new methods for building a
	knowledge resource of disease to medication associations. These methods use
	fundamentally different content and are based on advanced natural language
	processing and machine learning techniques. One method uses distributional
	semantics on large medical text, and the other uses data mining on a large
	number of patient records. The methods are evaluated using 25,379 unique
	disease-medication pairs extracted from 100 de-identified longitudinal patient
	records of a large multi-provider hospital system. We measured recall (R),
	precision (P), and F scores for positive and negative association prediction,
	along with coverage and accuracy. While individual methods performed well, a
	combined stacked classifier achieved the best performance, indicating the
	limitations and unique value of each resource and method. In predicting
	positive associations, the stacked combination significantly outperformed the
	baseline (a distant semi-supervised method on large medical text), achieving F
	scores of 0.75 versus 0.55 on the pairs seen in the patient records, and F
	scores of 0.69 and 0.35 on unique pairs.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>dandala-EtAl:2016:BioTxtM2016</bibkey>
  </paper>

  <paper id="5115">
    <title>Author Name Disambiguation in MEDLINE Based on Journal Descriptors and Semantic Types</title>
    <author><first>Dina</first><last>Vishnyakova</last></author>
    <author><first>Raul</first><last>Rodriguez-Esteban</last></author>
    <author><first>Khan</first><last>Ozol</last></author>
    <author><first>Fabio</first><last>Rinaldi</last></author>
    <booktitle>Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>134&#8211;142</pages>
    <url>http://aclweb.org/anthology/W16-5115</url>
    <abstract>Author name disambiguation (AND) in publication and citation resources is a
	well-known problem. Often, information about email address and other details in
	the affiliation is missing. In cases where such information is not available,
	identifying the authorship of publications becomes very challenging.
	Consequently, there have been attempts to resolve such cases by utilizing
	external resources as references. However, such external resources are
	heterogeneous and are not always reliable regarding the correctness of
	information.  To solve the AND task, especially when information about an
	author is not complete we suggest the use of new features such as journal
	descriptors (JD) and semantic types (ST). The evaluation of different feature
	models shows that their inclusion has an impact equivalent to that of other
	important features such as email address. Using such features we show that our
	system outperforms the state of the art.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>vishnyakova-EtAl:2016:BioTxtM2016</bibkey>
  </paper>

</volume>

