Results of the 4th edition of BioASQ Challenge

The goal of this task is to push the research frontier towards hybrid information systems. We aim to promote systems and approaches that are able to deal with the whole diversity of the Web, especially for, but not restricted to, the context of bio-medicine. This goal is pursued by the organization of challenges. The fourth challenge, as the previous challenges, consisted of two tasks: semantic indexing and question answering. 16 systems participated by 7 different participating teams for the semantic indexing task. The question answering task was tackled by 37 different systems, developed by 11 different teams. 25 of the systems participated in the phase A of the task, while 12 participated in phase B. 3 of the teams participated in both phases of the question answering task. Overall, as in previous years, the best systems were able to out-perform the strong baselines. This suggests that advances over the state of the art were achieved through the B IO ASQ challenge but also that the benchmark in itself is very challenging. In this paper, we present the data used during the challenge as well as the technologies which were at the core of the participants’ frameworks.


Introduction
The aim of this paper is twofold.First, we aim to give an overview of the data issued during the BioASQ challenge in 2016.In addition, we aim to present the systems that participated in the challenge and for which we received system descriptions, as well as evaluate their performance.To achieve these goals, we begin by giving a brief overview of the tasks, including the timing of the different tasks and the challenge data.Thereafter, we give an overview of the systems which participated in the challenge and provided us with an overview of the technologies they relied upon.Detailed descriptions of some of the systems are given in lab proceedings.The evaluation of the systems, which was carried out by using state-ofthe-art measures or manual assessment, is the last focal point of this paper.The conclusion sums up the results of this challenge.

Overview of the Tasks
The challenge comprised two tasks: (1) a largescale semantic indexing task (Task 4a) and (2) a question answering task (Task 4b).
Large-scale semantic indexing.In Task 4a the goal is to classify documents from the PubMed1 digital library into concepts of the MeSH2 hierarchy.Here, new PubMed articles that are not yet annotated are collected on a weekly basis.These articles are used as test sets for the evaluation of the participating systems.As soon as the annotations are available from the PubMed curators, the performance of each system is calculated by using standard information retrieval measures as well as hierarchical ones.The winners of each batch were decided based on their performance in the Micro F-measure (MiF) from the family of flat measures (Tsoumakas et al., 2010), and the Lowest Common Ancestor F-measure (LCA-F) from the family of hierarchical measures (Kosmopoulos et al., 2013).For completeness several other flat and hierarchical measures were reported (Balikas et al., 2013).In order to provide an on-line and large-scale scenario, the task was divided into three independent batches.In each batch 5 test sets of biomedical articles were released consecutively.Each of these test sets were released in a weekly basis and the participants had 21 hours to provide their answers.Figure 1 gives an overview of the time plan of Task 4a.
Biomedical semantic QA.The goal of task 4b was to provide a large-scale question answering challenge where the systems should be able to cope with all the stages of a question answering task, including the retrieval of relevant concepts and articles, as well as the provision of natural-language answers.Task 4b comprised two phases: In phase A, BIOASQ released questions in English from benchmark datasets created by a group of biomedical experts.There were four types of questions: "yes/no" questions, "factoid" questions,"list" questions and "summary" questions (Balikas et al., 2013).Participants had to respond with relevant concepts (from specific terminologies and ontologies), relevant articles (PubMed articles), relevant snippets extracted from the relevant articles and relevant RDF triples (from specific ontologies).In phase B, the released questions contained the correct answers for the required elements (articles and snippets) of the first phase.The participants had to answer with exact answers as well as with paragraph-sized summaries in natural language (dubbed ideal answers).
The task was split into five independent batches.The two phases for each batch were run with a time gap of 24 hours.For each phase, the participants had 24 hours to submit their answers.We used well-known measures such as mean precision, mean recall, mean F-measure, mean average precision (MAP) and geometric MAP (GMAP) to evaluate the performance of the participants in Phase A. The winners were selected based on MAP.The evaluation in phase B for the ideal answers was carried out manually by biomedical experts on the answers provided by the systems.For the sake of completeness, ROUGE (Lin, 2004) is also reported.For the exact answers, we used accuracy for the yes/no questions, mean reciprocal rank (MRR) for the factoids and mean F-measure for the list questions.
3 Overview of Participants

Task 4a
In this subsection we describe the proposed systems which have sent a description and stress their key characteristics.In (Papagiannopoulou et al., 2016) flat classification processes were employed for the semantic indexing task.In particular, they used as a training set the last 1 million articles and kept the last 50 thousand as a validation set.Pre-processing of the articles was carried out by concatenated the abstract and the title.One-grams and bi-grams were used as features, removing stop-words and features with less than five occurrences in the corpus.The tf-idf representation has been used for the features.The proposed system includes several multi-label classifiers (MLC) that are combined in ensembles.In particular, they used the Meta-Labeler, a set of Binary Relevance (BR) models with Linear SVMs and a Labeled LDA variant, Prior LDA.All the above models were combined in an ensemble, using the MULE framework, a statistical significance multi-label ensemble that performs classifier selection.The approach proposed by (Segura-Bedmar et al., 2016) is based on Elastic Search.They use ElasticSearch in order to index the training set provided by the BioASQ.Then, each document in the test set is translated into a query, that is fired against the index built from the training set, returning the most relevant documents and their MeSH categories.Finally, each MeSH category is ranked using a scoring system based on the frequency of the category and the similarity of relevant documents, which contain the category, with the test document to classify.
Baselines.During the challenge three systems were served as baseline systems.The first baseline is a state-of-the-art method called Medical Text Indexer (MTI) (Mork et al., 2014) which is developed by the National Library of Medicine3 and serves as a classification system for articles of MEDLINE.MTI is used by curators in order to assist them in the annotation process.The second baseline is an extension of the system MTI with the approaches of the first BioASQ challenge's winner (Tsoumakas et al., 2013).The third one, dubbed BioASQ Filtering (Zavorin et al., 2016) is a new extension of the MTI system.In particular, Learning to Rank methodology is used as a boosting component of the MTI system.The improved system shows significant gains in both precision and recall for some specific classes of MeSH headings.

Task 4b
As mentioned above, the second task of the challenge is split into two phases.In the first phase, where the goal is to annotate questions with relevant concepts, documents, snippets and RDF triples 9 teams with 25 systems participated.In the second phase, where teams are requested to submit exact and paragraph-sized answers for the questions, 5 teams with 12 different systems participated.The system presented in (Papagiannopoulou et al., 2016) is based on Indri search engine, and they use MetaMap and LingPipe to detect the biomedical concepts in local ontology files.For the relevant snippets, they calculate the semantic similarity between each one of the sentences and the query (expanded with synonyms) using a semantic similarity measure.Concerning phase B, They provided exact answers only for the factoid questions.Their system is based on their previous participation in BioASQ challenge (Papanikolaou et al., 2014).The system tries to extract the lexical answer type by manipulating the words of the question.Then, the relevant snippets of the question which are provided as inputs for this tasks are processed with the 2013 release of MetaMap in order to extract candidate answers.This year, they have extended their approach by expanding both the scoring mechanism, as well as the set of candidate answers.The system presented in (Yang et al., 2016), extends the system in (Yang et al., 2015).In particular, they used TmTool (CH et al., 2016), in addition to MetaMap, to identify possible biomedical named entities, especially out-ofvocabulary concepts.In addition, they also extract frequent multi-word terms from relevant snippets to further improve the recall of concept and candidate answer text extraction.They also introduced a unified classification interface for judging the relevance of each retrieved concept, document, and snippet, which can combine the relevant scores evidenced by various sources.A supervised learning method is used to rerank the answer candidates for factoid and list questions based on the relation between each candidate answer and other candidate answers.The system presented in (Schulze et al., 2016) relies on the Hana Database for text processing.It uses the Stanford CoreNLP package for tokenizing the questions.Each of the tokens is then sent to the BioPortal and to the Hana database for concept retrieval.The concepts retrieved from the two stores are finally merged to a single list that is used to retrieve relevant text passages from the documents at hand.The second system relies on existing NLP functionality in the IMDB.They have extended it with new functions tailored specifically to QA.The approach presented in (gu Lee et al., 2016) participated in phase A of task 4b.The main focus was the retrieval of relevant documents and snippets.The proposed system uses a clusterbased language model.Then, it reranks the retrieved top-n sentences using five independent similarity models based on shallow semantic analysis.

Task 4a
During the evaluation phase of the Task 4a, the participants submitted their results on a weekly basis to the online evaluation platform of the challenge4 .The evaluation period was divided into three batches containing 5 test sets each.7 teams were participated in the task with a total of 16 systems.For measuring the classification performance of the systems several evaluation measures were used both flat and hierarchical ones (Balikas et al., 2013).The micro F-measure (MiF) and the Lowest Common Ancestor F-measure (LCA-F) were used to asses the systems and choose the winners for each batch (Kosmopoulos et al., 2013).12, 208, 342 articles with 27, 301 labels (19.4GB) were provided as training data to the participants.Table 1 shows the number of articles in each test set of each batch of the challenge.
Table 2 presents the correspondence of the systems for which a description was available and the submitted systems in Task 4a.The systems MTI First Line Index, Default MTI, BioASQ Filtering were the baseline systems used throughout the challenge.Systems that participated in less than 4 test sets in each batch are not reported in the results5 .According to (Demsar, 2006) the appropriate way to compare multiple classification systems over multiple datasets is based on their average rank across all the datasets.On each dataset the system with the best performance gets rank 1.0, the second best rank 2.0 and so on.In case that two or more systems tie, they all receive the average rank.Tables 3 presents the average rank (according to MiF and LCA-F) of each system over all the test sets for the corresponding batches.Note, that the average ranks are calculated for the 4 best results of each system in the batch according to the rules of the challenge6 .The best ranked system is highlighted with bold typeface.

Task 4b
Phase A. Table 4 presents the statistics of the training and test data provided to the participants.The evaluation included five test batches.For the phase A of Task 4b the systems were allowed to submit responses to any of the corresponding types of annotations, that is documents, concepts, snippets and RDF triples.For each of the categories we rank the systems according to the Mean Average Precision (MAP) measure (Balikas et al., 2013).The final ranking for each batch is calculated as the average of the individual rankings in the different categories.In tables 6 and 7 some indicative results from batch 1 are presented.
The detailed results for Task 4b phase A can be found in http://participants-area.bioasq.org/results/4b/phaseA/.
Phase B. In the phase B of Task 4b the systems were asked to report exact and ideal answers.The systems were ranked according to the manual evaluation of ideal answers by the BioASQ experts (Balikas et al., 2013), and according to automatic measures for the exact answers.
Table 7 shows the results for the exact answers for the first batch of task 4a.In case that systems  (Segura-Bedmar et al., 2016) LABDA ElasticSearch, LargeElasticLABDA, LABDA baseline Baselines ( (Mork et al., 2013), (Zavorin et al., 2016)) MTI First Line Index, Default MTI, BioASQ Filtering Table 3: Average ranks for each system across the batches of the task 4a for the measures MiF and LCA-F.A hyphenation symbol (-) is used whenever the system participated in less than 4 times in the batch.didn't provide exact answers for a particular kind of questions we used the symbol "-".The results of the other batches are available at http://participants-area.bioasq.org/results/4b/phaseB/.

System
From those results we can see that the systems are achieving a very high (> 90% accuracy) performance in the yes/no questions.The performance in factoid and list questions is not as good indicating that there is room for improvements.

Conclusion
In this paper, an overview of the fourth BioASQ challenge is presented.As the previous challenges, the challenge consisted of two tasks: semantic indexing and question answering.Overall, as in previous years, the best systems were able to outperform the strong baselines provided by the organizers.This suggests that advances over the state of the art were achieved through the BIOASQ challenge but also that the benchmark in itself is very challenging.Consequently, we regard the outcome of the challenge as a success towards pushing the research on bio-medical information systems a step further.In future editions of the challenge, we aim to provide even more benchmark data derived from a community-driven acquisition process.

Table 4 :
Statistics on the training and test datasets of Task 4b.All the numbers for the documents, snippets, concepts and triples refer to averages.

Table 1 :
Statistics on the test datasets of Task 4a.

Table 2 :
Correspondence of reference and submitted systems for Task 4a.

Table 5 :
Results for batch 1 for documents in phase A of Task 4b.

Table 6 :
Results for batch 1 for snippets in phase A of Task 4b.

Table 7 :
Results for batch 3 for exact answers in phase B of Task 4b.