Identifying Key Sentences for Precision Oncology Using Semi-Supervised Learning

Jurica Ševa, Martin Wackerbauer, Ulf Leser


Abstract
We present a machine learning pipeline that identifies key sentences in abstracts of oncological articles to aid evidence-based medicine. This problem is characterized by the lack of gold standard datasets, data imbalance and thematic differences between available silver standard corpora. Additionally, available training and target data differs with regard to their domain (professional summaries vs. sentences in abstracts). This makes supervised machine learning inapplicable. We propose the use of two semi-supervised machine learning approaches: To mitigate difficulties arising from heterogeneous data sources, overcome data imbalance and create reliable training data we propose using transductive learning from positive and unlabelled data (PU Learning). For obtaining a realistic classification model, we propose the use of abstracts summarised in relevant sentences as unlabelled examples through Self-Training. The best model achieves 84% accuracy and 0.84 F1 score on our dataset
Anthology ID:
W18-2305
Volume:
Proceedings of the BioNLP 2018 workshop
Month:
July
Year:
2018
Address:
Melbourne, Australia
Venues:
ACL | BioNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
35–46
Language:
URL:
https://aclanthology.org/W18-2305
DOI:
10.18653/v1/W18-2305
Bibkey:
Cite (ACL):
Jurica Ševa, Martin Wackerbauer, and Ulf Leser. 2018. Identifying Key Sentences for Precision Oncology Using Semi-Supervised Learning. In Proceedings of the BioNLP 2018 workshop, pages 35–46, Melbourne, Australia. Association for Computational Linguistics.
Cite (Informal):
Identifying Key Sentences for Precision Oncology Using Semi-Supervised Learning (Ševa et al., 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-2305.pdf
Code
 nachne/semisuper
Data
HOC