Automatic Biomedical Term Clustering by Learning Fine-grained Term Representations

Sihang Zeng, Zheng Yuan, Sheng Yu


Abstract
Term clustering is important in biomedical knowledge graph construction. Using similarities between terms embedding is helpful for term clustering. State-of-the-art term embeddings leverage pretrained language models to encode terms, and use synonyms and relation knowledge from knowledge graphs to guide contrastive learning. These embeddings provide close embeddings for terms belonging to the same concept. However, from our probing experiments, these embeddings are not sensitive to minor textual differences which leads to failure for biomedical term clustering. To alleviate this problem, we adjust the sampling strategy in pretraining term embeddings by providing dynamic hard positive and negative samples during contrastive learning to learn fine-grained representations which result in better biomedical term clustering. We name our proposed method as CODER++, and it has been applied in clustering biomedical concepts in the newly released Biomedical Knowledge Graph named BIOS.
Anthology ID:
2022.bionlp-1.8
Volume:
Proceedings of the 21st Workshop on Biomedical Language Processing
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Dina Demner-Fushman, Kevin Bretonnel Cohen, Sophia Ananiadou, Junichi Tsujii
Venue:
BioNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
91–96
Language:
URL:
https://aclanthology.org/2022.bionlp-1.8
DOI:
10.18653/v1/2022.bionlp-1.8
Bibkey:
Cite (ACL):
Sihang Zeng, Zheng Yuan, and Sheng Yu. 2022. Automatic Biomedical Term Clustering by Learning Fine-grained Term Representations. In Proceedings of the 21st Workshop on Biomedical Language Processing, pages 91–96, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Automatic Biomedical Term Clustering by Learning Fine-grained Term Representations (Zeng et al., BioNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.bionlp-1.8.pdf
Video:
 https://aclanthology.org/2022.bionlp-1.8.mp4
Code
 GanjinZero/CODER
Data
BC5CDR