2024
pdf
bib
abs
NeuroTrialNER: An Annotated Corpus for Neurological Diseases and Therapies in Clinical Trial Registries
Simona Emilova Doneva
|
Tilia Ellendorff
|
Beate Sick
|
Jean-Philippe Goldman
|
Amelia Elaine Cannon
|
Gerold Schneider
|
Benjamin Victor Ineichen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Extracting and aggregating information from clinical trial registries could provide invaluable insights into the drug development landscape and advance the treatment of neurologic diseases. However, achieving this at scale is hampered by the volume of available data and the lack of an annotated corpus to assist in the development of automation tools. Thus, we introduce NeuroTrialNER, a new and fully open corpus for named entity recognition (NER). It comprises 1093 clinical trial summaries sourced from ClinicalTrials.gov, annotated for neurological diseases, therapeutic interventions, and control treatments. We describe our data collection process and the corpus in detail. We demonstrate its utility for NER using large language models and achieve a close-to-human performance. By bridging the gap in data resources, we hope to foster the development of text processing tools that help researchers navigate clinical trials data more easily.
2021
pdf
bib
abs
Approaching SMM4H with auto-regressive language models and back-translation
Joseph Cornelius
|
Tilia Ellendorff
|
Fabio Rinaldi
Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task
We describe our submissions to the 6th edition of the Social Media Mining for Health Applications (SMM4H) shared task. Our team (OGNLP) participated in the sub-task: Classification of tweets self-reporting potential cases of COVID-19 (Task 5). For our submissions, we employed systems based on auto-regressive transformer models (XLNet) and back-translation for balancing the dataset.
2020
pdf
bib
abs
COVID-19 Twitter Monitor: Aggregating and Visualizing COVID-19 Related Trends in Social Media
Joseph Cornelius
|
Tilia Ellendorff
|
Lenz Furrer
|
Fabio Rinaldi
Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task
Social media platforms offer extensive information about the development of the COVID-19 pandemic and the current state of public health. In recent years, the Natural Language Processing community has developed a variety of methods to extract health-related information from posts on social media platforms. In order for these techniques to be used by a broad public, they must be aggregated and presented in a user-friendly way. We have aggregated ten methods to analyze tweets related to the COVID-19 pandemic, and present interactive visualizations of the results on our online platform, the COVID-19 Twitter Monitor. In the current version of our platform, we offer distinct methods for the inspection of the dataset, at different levels: corpus-wide, single post, and spans within each post. Besides, we allow the combination of different methods to enable a more selective acquisition of knowledge. Through the visual and interactive combination of various methods, interconnections in the different outputs can be revealed.
2019
pdf
bib
abs
Approaching SMM4H with Merged Models and Multi-task Learning
Tilia Ellendorff
|
Lenz Furrer
|
Nicola Colic
|
Noëmi Aepli
|
Fabio Rinaldi
Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task
We describe our submissions to the 4th edition of the Social Media Mining for Health Applications (SMM4H) shared task. Our team (UZH) participated in two sub-tasks: Automatic classifications of adverse effects mentions in tweets (Task 1) and Generalizable identification of personal health experience mentions (Task 4). For our submissions, we exploited ensembles based on a pre-trained language representation with a neural transformer architecture (BERT) (Tasks 1 and 4) and a CNN-BiLSTM(-CRF) network within a multi-task learning scenario (Task 1). These systems are placed on top of a carefully crafted pipeline of domain-specific preprocessing steps.
2018
pdf
bib
abs
UZH@SMM4H: System Descriptions
Tilia Ellendorff
|
Joseph Cornelius
|
Heath Gordon
|
Nicola Colic
|
Fabio Rinaldi
Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task
Our team at the University of Zürich participated in the first 3 of the 4 sub-tasks at the Social Media Mining for Health Applications (SMM4H) shared task. We experimented with different approaches for text classification, namely traditional feature-based classifiers (Logistic Regression and Support Vector Machines), shallow neural networks, RCNNs, and CNNs. This system description paper provides details regarding the different system architectures and the achieved results.
2016
pdf
bib
abs
The PsyMine Corpus - A Corpus annotated with Psychiatric Disorders and their Etiological Factors
Tilia Ellendorff
|
Simon Foster
|
Fabio Rinaldi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
We present the first version of a corpus annotated for psychiatric disorders and their etiological factors. The paper describes the choice of text, annotated entities and events/relations as well as the annotation scheme and procedure applied. The corpus is featuring a selection of focus psychiatric disorders including depressive disorder, anxiety disorder, obsessive-compulsive disorder, phobic disorders and panic disorder. Etiological factors for these focus disorders are widespread and include genetic, physiological, sociological and environmental factors among others. Etiological events, including annotated evidence text, represent the interactions between their focus disorders and their etiological factors. Additionally to these core events, symptomatic and treatment events have been annotated. The current version of the corpus includes 175 scientific abstracts. All entities and events/relations have been manually annotated by domain experts and scores of inter-annotator agreement are presented. The aim of the corpus is to provide a first gold standard to support the development of biomedical text mining applications for the specific area of mental disorders which belong to the main contributors to the contemporary burden of disease.
2014
pdf
bib
abs
Using Large Biomedical Databases as Gold Annotations for Automatic Relation Extraction
Tilia Ellendorff
|
Fabio Rinaldi
|
Simon Clematide
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
We show how to use large biomedical databases in order to obtain a gold standard for training a machine learning system over a corpus of biomedical text. As an example we use the Comparative Toxicogenomics Database (CTD) and describe by means of a short case study how the obtained data can be applied. We explain how we exploit the structure of the database for compiling training material and a testset. Using a Naive Bayes document classification approach based on words, stem bigrams and MeSH descriptors we achieve a macro-average F-score of 61% on a subset of 8 action terms. This outperforms a baseline system based on a lookup of stemmed keywords by more than 20%. Furthermore, we present directions of future work, taking the described system as a vantage point. Future work will be aiming towards a weakly supervised system capable of discovering complete biomedical interactions and events.
2013
pdf
bib
UZH in BioNLP 2013
Gerold Schneider
|
Simon Clematide
|
Tilia Ellendorff
|
Don Tuggener
|
Fabio Rinaldi
|
Gintarė Grigonytė
Proceedings of the BioNLP Shared Task 2013 Workshop