Foundation Models of Scientific Knowledge for Chemistry: Opportunities, Challenges and Lessons Learned

Sameera Horawalavithana, Ellyn Ayton, Shivam Sharma, Scott Howland, Megha Subramanian, Scott Vasquez, Robin Cosbey, Maria Glenski, Svitlana Volkova


Abstract
Foundation models pre-trained on large corpora demonstrate significant gains across many natural language processing tasks and domains e.g., law, healthcare, education, etc. However, only limited efforts have investigated the opportunities and limitations of applying these powerful models to science and security applications. In this work, we develop foundation models of scientific knowledge for chemistry to augment scientists with the advanced ability to perceive and reason at scale previously unimagined. Specifically, we build large-scale (1.47B parameter) general-purpose models for chemistry that can be effectively used to perform a wide range of in-domain and out-of-domain tasks. Evaluating these models in a zero-shot setting, we analyze the effect of model and data scaling, knowledge depth, and temporality on model performance in context of model training efficiency. Our novel findings demonstrate that (1) model size significantly contributes to the task performance when evaluated in a zero-shot setting; (2) data quality (aka diversity) affects model performance more than data quantity; (3) similarly, unlike previous work, temporal order of the documents in the corpus boosts model performance only for specific tasks, e.g., SciQ; and (4) models pre-trained from scratch perform better on in-domain tasks than those tuned from general-purpose models like Open AI’s GPT-2.
Anthology ID:
2022.bigscience-1.12
Original:
2022.bigscience-1.12v1
Version 2:
2022.bigscience-1.12v2
Volume:
Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models
Month:
May
Year:
2022
Address:
virtual+Dublin
Editors:
Angela Fan, Suzana Ilic, Thomas Wolf, Matthias Gallé
Venue:
BigScience
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
160–172
Language:
URL:
https://aclanthology.org/2022.bigscience-1.12
DOI:
10.18653/v1/2022.bigscience-1.12
Bibkey:
Cite (ACL):
Sameera Horawalavithana, Ellyn Ayton, Shivam Sharma, Scott Howland, Megha Subramanian, Scott Vasquez, Robin Cosbey, Maria Glenski, and Svitlana Volkova. 2022. Foundation Models of Scientific Knowledge for Chemistry: Opportunities, Challenges and Lessons Learned. In Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models, pages 160–172, virtual+Dublin. Association for Computational Linguistics.
Cite (Informal):
Foundation Models of Scientific Knowledge for Chemistry: Opportunities, Challenges and Lessons Learned (Horawalavithana et al., BigScience 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.bigscience-1.12.pdf
Video:
 https://aclanthology.org/2022.bigscience-1.12.mp4
Code
 eleutherai/gpt-neox
Data
BLUEBoolQCORD-19LAMBADAMathQAOpenBookQAPIQAPubMedQAS2ORCSciQThe PileWSCWebTextWiC