CoSAEmb: Contrastive Section-aware Aspect Embeddings for Scientific Articles

Shruti Singh, Mayank Singh


Abstract
Research papers are long documents that contain information about various aspects such as background, prior work, methodology, and results. Existing works on scientific document representation learning only leverage the title and abstract of the paper. We present CoSAEmb, a model that learns representations from the full text of 97402 scientific papers from the S2ORC dataset. We present a novel supervised contrastive training framework for long documents using triplet loss and margin gradation. Our framework can be used to learn representations of long documents with any existing encoder-only transformer model without retraining it from scratch. CoSAEmb shows improved performance on information retrieval from the paper’s full text in comparison to models trained only on paper titles and abstracts. We also evaluate CoSAEmb on SciRepEval and CSFCube benchmarks, showing comparable performance with existing state-of-the-art models.
Anthology ID:
2024.sdp-1.27
Volume:
Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Tirthankar Ghosal, Amanpreet Singh, Anita Waard, Philipp Mayr, Aakanksha Naik, Orion Weller, Yoonjoo Lee, Shannon Shen, Yanxia Qin
Venues:
sdp | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
283–292
Language:
URL:
https://aclanthology.org/2024.sdp-1.27
DOI:
Bibkey:
Cite (ACL):
Shruti Singh and Mayank Singh. 2024. CoSAEmb: Contrastive Section-aware Aspect Embeddings for Scientific Articles. In Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024), pages 283–292, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
CoSAEmb: Contrastive Section-aware Aspect Embeddings for Scientific Articles (Singh & Singh, sdp-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.sdp-1.27.pdf