MedDistant19: Towards an Accurate Benchmark for Broad-Coverage Biomedical Relation Extraction

Saadullah Amin, Pasquale Minervini, David Chang, Pontus Stenetorp, Guenter Neumann


Abstract
Relation extraction in the biomedical domain is challenging due to the lack of labeled data and high annotation costs, needing domain experts. Distant supervision is commonly used to tackle the scarcity of annotated data by automatically pairing knowledge graph relationships with raw texts. Such a pipeline is prone to noise and has added challenges to scale for covering a large number of biomedical concepts. We investigated existing broad-coverage distantly supervised biomedical relation extraction benchmarks and found a significant overlap between training and test relationships ranging from 26% to 86%. Furthermore, we noticed several inconsistencies in the data construction process of these benchmarks, and where there is no train-test leakage, the focus is on interactions between narrower entity types. This work presents a more accurate benchmark MedDistant19 for broad-coverage distantly supervised biomedical relation extraction that addresses these shortcomings and is obtained by aligning the MEDLINE abstracts with the widely used SNOMED Clinical Terms knowledge base. Lacking thorough evaluation with domain-specific language models, we also conduct experiments validating general domain relation extraction findings to biomedical relation extraction.
Anthology ID:
2022.coling-1.198
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
2259–2277
Language:
URL:
https://aclanthology.org/2022.coling-1.198
DOI:
Bibkey:
Cite (ACL):
Saadullah Amin, Pasquale Minervini, David Chang, Pontus Stenetorp, and Guenter Neumann. 2022. MedDistant19: Towards an Accurate Benchmark for Broad-Coverage Biomedical Relation Extraction. In Proceedings of the 29th International Conference on Computational Linguistics, pages 2259–2277, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
MedDistant19: Towards an Accurate Benchmark for Broad-Coverage Biomedical Relation Extraction (Amin et al., COLING 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.coling-1.198.pdf
Code
 suamin/meddistant19
Data
PubmedUMLS