SuMe: A Dataset Towards Summarizing Biomedical Mechanisms

Mohaddeseh Bastan, Nishant Shankar, Mihai Surdeanu, Niranjan Balasubramanian


Abstract
Can language models read biomedical texts and explain the biomedical mechanisms discussed? In this work we introduce a biomedical mechanism summarization task. Biomedical studies often investigate the mechanisms behind how one entity (e.g., a protein or a chemical) affects another in a biological context. The abstracts of these publications often include a focused set of sentences that present relevant supporting statements regarding such relationships, associated experimental evidence, and a concluding sentence that summarizes the mechanism underlying the relationship. We leverage this structure and create a summarization task, where the input is a collection of sentences and the main entities in an abstract, and the output includes the relationship and a sentence that summarizes the mechanism. Using a small amount of manually labeled mechanism sentences, we train a mechanism sentence classifier to filter a large biomedical abstract collection and create a summarization dataset with 22k instances. We also introduce conclusion sentence generation as a pretraining task with 611k instances. We benchmark the performance of large bio-domain language models. We find that while the pretraining task help improves performance, the best model produces acceptable mechanism outputs in only 32% of the instances, which shows the task presents significant challenges in biomedical language understanding and summarization.
Anthology ID:
2022.lrec-1.748
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6922–6931
Language:
URL:
https://aclanthology.org/2022.lrec-1.748
DOI:
Bibkey:
Cite (ACL):
Mohaddeseh Bastan, Nishant Shankar, Mihai Surdeanu, and Niranjan Balasubramanian. 2022. SuMe: A Dataset Towards Summarizing Biomedical Mechanisms. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6922–6931, Marseille, France. European Language Resources Association.
Cite (Informal):
SuMe: A Dataset Towards Summarizing Biomedical Mechanisms (Bastan et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.748.pdf
Code
 StonyBrookNLP/SuMe +  additional community code
Data
SuMeBLUE