The Diminishing Returns of Masked Language Models to Science

Transformer-based masked language models such as BERT, trained on general corpora, have shown impressive performance on downstream tasks. It has also been demonstrated that the downstream task performance of such models can be improved by pretraining larger models for longer on more data. In this work, we empirically evaluate the extent to which these results extend to tasks in science. We use 14 domain-specific transformer-based models (including ScholarBERT, a new 770M-parameter science-focused masked language model pretrained on up to 225B tokens) to evaluate the impact of training data, model size, pretraining and finetuning time on 12 downstream scientific tasks. Interestingly, we find that increasing model sizes, training data, or compute time does not always lead to significant improvements (i.e.,>1% F1), if at all, in scientific information extraction tasks and offered possible explanations for the surprising performance differences.


Introduction
Massive growth in the number of scientific publications places considerable cognitive burden on researchers (Teplitskiy et al., 2022).Language models can potentially alleviate this burden by automating the scientific knowledge extraction process.BERT (Devlin et al., 2019) was pretrained on a general corpus (BooksCorpus and Wikipedia) which differs from scientific literature in terms of the context, terminology, and writing style (Ahmad, 2012).Other masked language models have since been pretrained on domain-specific scientific corpora (Huang and Cole, 2022;Gu et al., 2021;Gururangan et al., 2020;Beltagy et al., 2019) with the goal of improving downstream task performance.(We use the term domain to indicate a specific scientific discipline, such as biomedical science or computer science.)Other studies (Liu et al., 2019;Kaplan et al., 2020) explored the impact of varying model size, training corpus size, and compute time on downstream task performance.However, no previous work has investigated how these parameters affect science-focused models.
In this study, we train a set of scientific language models of different sizes, collectively called SCHOLARBERT, on a large, multidisciplinary scientific corpus of 225B tokens to understand the effects of model size, data size, and compute time (specifically, pretraining and finetuning epochs) on downstream task performance.We find that for information extraction tasks, an important application for scientific language models, the performance gains by training a larger model for longer with more data are not robust-they are highly taskdependent.We make the SCHOLARBERT models and a sample of the training corpus publicly available to encourage further studies.

Related Work
Prior research has explored the effects of varying model size, dataset size, and amount of compute on language model performance.Kaplan et al. (2020) demonstrated that crossentropy training loss scales as a power law with model size, dataset size, and compute time for unidirectional decoder-only architectures.Brown et al. (2020) showed that language model few-shot learning abilities can be improved by using larger models.However, both studies explored only the Generative Pretrained Transformer (GPT), an autoregressive generative model (Brown et al., 2020).
Comparing BERT-Base (110M parameters) and BERT-Large (340M parameters), Devlin et al. (2019) showed that masked language models can also benefit from more parameters.Likewise, Liu et al. (2019) demonstrate how BERT models can benefit from training for longer periods, with bigger batches, and with more data.
Models such as BERT and RoBERTa were pretrained on general corpora.To boost performance on scientific downstream tasks, SciBERT (Belt-agy et al., 2019), PubMedBERT (Gu et al., 2021), BioBERT (Lee et al., 2020), andMatBERT (Trewartha et al., 2022) were trained on domainspecific text with the goal of enhancing performance on tasks requiring domain knowledge.Yet there is no work on how that task performance varies with pretraining parameters.

Data and Methodology
We outline the pretraining dataset, related models to which we compare performance, and the architecture and pretraining process used for creating the SCHOLARBERT models.

The Public Resource Dataset
We pretrain the SCHOLARBERT models on a dataset provided by Public.Resource.Org, Inc. ("Public Resource"), a nonprofit organization based in California.This dataset was constructed from a corpus of 85M journal article PDF files, from which the Grobid tool, version 0.5.5, was used to extract text (GROBID).Not all extractions were successful, because of corrupted or badly encoded PDF files.We work here with text from ∼75M articles in this dataset, categorized as 45.3% biomedicine, 23.1% technology, 20.0% physical sciences, 8.4% social sciences, and 3.1% arts & humanities.(A sample of the extracted texts and corresponding original PDFs is available in the Data attachment for review purposes.)

Models
We consider 14 BERT models: seven from existing literature (BERT-Base, BERT-Large, SciBERT, PubMedBERT, BioBERT v1.2, MatBERT, and Bat-teryBERT: Appendix A); and seven SCHOLAR-BERT variants pretrained on different subsets of the Public Resource dataset (and, in some cases, also the WikiBooks corpus).We distinguish these models along the four dimensions listed in Table 1: architecture, pretraining method, pretraining corpus, and casing.SCHOLARBERT and SCHOLARBERT-XL, with 340M and 770M parameters, respectively, are the largest sciencespecific BERT models reported to date.Prior literature demonstrates the efficacy of pretraining BERT models on domain-specific corpora (Sun et al., 2019;Fabien et al., 2020).However, the ever-larger scientific literature makes pretraining domain-specific language models prohibitively expensive.A promising alternative is to create larger, multi-disciplinary BERT models, such as SCHOLARBERT, that harness the increased availability of diverse pretraining text; researchers can then adapt (i.e., finetune) these general-purpose science models to meet their specific needs.

SCHOLARBERT Pretraining
We randomly sample 1%, 10%, and 100% of the Public Resource dataset to create PRD_1, PRD_10, and PRD_100.We pretrain SCHOLARBERT models on these PRD subsets by using the RoBERTa pretraining procedure, which has been shown to produce better downstream task performance in a variety of domains (Liu et al., 2019).See Appendix B.2 for details.

Experimental Results
We first perform sensitivity analysis across Scholar-BERT pretraining dimensions to determine the trade-off between time spent in pretraining versus finetuning.We also compare the downstream task performance of SCHOLARBERT to that achieved with other BERT models.Details of each evaluation task are in Appendix C.

Sensitivity Analysis
We save checkpoints periodically while pretraining each SCHOLARBERT(-XL) model.In this analysis, we checkpoint at ∼0.9k, 5k, 10k, 23k, and 33k iterations based on the decrease of training loss between iterations.We observe that pretraining loss decreases rapidly until around 10 000 iterations, and that further training to convergence (roughly 33 000 iterations) yields only small decreases of training loss: see Figure 1 in Appendix.
To measure how downstream task performance is impacted by pretraining and finetuning time, we finetune each of the checkpointed models for 5 and 75 epochs.We observe that: (1) The undertrained 0.9k-iteration model sees the biggest boost in the F1 scores of downstream tasks (+8%) with more finetuning, but even with 75 epochs of finetuning the 0.9k-iteration models' average F1 score is still 19.9 percentage points less than that of the 33k-iteration model with 5 epochs of finetuning.
(2) For subsequent checkpoints, the performance gains from more finetuning decreases as the number of pretraining iterations increases.The average downstream task performance of the 33k-iteration model is only 0.39 percentage points higher with 75 epochs of finetuning than with 5 epochs.There- fore, in the remaining experiments, we use the SCHOLARBERT(-XL) model that was pretrained for 33k iterations and finetuned for 5 epochs.

Finetuning
We finetuned the SCHOLARBERT models and the state-of-the-art scientific models listed in Table 1 on NER, relation extraction, and sentence classification tasks.F1 scores for each model-task pair, averaged over five runs, are shown in Tables 2 and 3.
For NER tasks, we use the CoNLL NER evaluation Perl script (Sang and De Meulder, 2003) to compute F1 scores for each test.
Tables 2 and 3 show the results, from which we observe: (1) With the same training data, a larger model does not always achieve significant performance improvements.BERT-Base achieved F1 scores within 1 percentage point of BERT-Large on 6/12 tasks; SB_1 achieved F1 scores within 1 percentage point of SB-XL_1 on 7/12 tasks; SB_100 achieved F1 scores within 1 percentage point of SB-XL_100 on 6/12 tasks.( 2) With the same model size, a model pretrained on more data cannot guarantee significant performance improvements.SB_1 achieved F1 scores within 1 percentage point of SB_100 on 8/12 tasks; SB_10_WB achieved F1 scores within 1 percentage point of SB_100_WB on 7/12 tasks; SB-XL_1 achieved F1 scores within 1 percentage point of SB-XL_100 on 10/12 tasks.(3) Domain-specific pretraining cannot guarantee significant performance improvements.The Biomedical domain is the only domain where we see the on-domain model (i.e., pretrained for the associated domain; marked with underlines; in this case, PubMedBERT) consistently outperformed models pretrained on off-domain or more general corpora by more than 1 percentage point F1.The same cannot be said for CS, Materials, or Multi-Domain tasks.

Discussion
Here we offer possible explanations for the three observations above.(1) The nature of the task is more indicative of task performance than the size of the model.In particular, with the same training data, a larger model size impacts performance only for relation extraction tasks, which consistently saw F1 scores increase by more than 1 percentage point when going from smaller models to larger models (i.e., BERT-Base to BERT-Large, SB_1 to SB-XL_1, SB_100 to SB-XL_100).In contrast, the NER and sentence classification tasks did not see such consistent significant improvements.(2) Our biggest model, SCHOLARBERT-XL, is only twice as large as the original BERT-Large, but its pretraining corpus is 100X larger.The training loss of the SCHOLARBERT-XL_100 model dropped rapidly only in the first ∼10k iterations (Fig. 1 in Appendix), which covered the first 1/3 of the PRD corpus, thus it is possible that the PRD corpus can saturate even our biggest model.(Kaplan et al., 2020;Hoffmann et al., 2022).(3) Finetuning can compensate for missing domain-specific knowledge in pretraining data.While pretraining language models on a specific domain can help learn domain-specific concepts, finetuning can also fill holes in the pretraining corpora's domain knowledge, as long as the pretraining corpus incorporates the characteristics specific to the finetuning dataset.

Conclusions
We have reported experiments that compare and evaluate the impact of various parameters (model size, pretraining dataset size and breadth, and pretraining and finetuning lengths) on the performance of different language models pretrained on scientific literature.Our results encompass 14 existing and newly-developed BERT-based language models across 12 scientific downstream tasks.We find that model performance on downstream scientific information extraction tasks is not improved significantly or consistently by increasing any of the four parameters considered (model size, amount of pretraining data, pretraining time, finetuning time).We attribute these results to both the power of finetuning and limitations in the evaluation datasets, as well as (for the SCHOLARBERT models) small model sizes relative to the large pretraining corpus.
We make the ScholarBERT models available on HuggingFace (https://huggingface.co/ globuslabs).While we cannot share the full Public Resource dataset, we have provided a sample of open-access articles from the dataset (https://github.com/tuhz/PublicResourceDatasetSample) in both the original PDF and extracted txt formats to illustrate the quality of the PDF-to-text preprocessing.

Limitations
Our 12 labeled test datasets are from just five domains (plus two multi-disciplinary); five of the 12 are from biomedicine.This imbalance, which reflects the varied adoption of NLP methods across domains, means that our evaluation dataset is necessarily limited.Our largest model, with 770M parameters, may not be sufficiently large to demonstrate scaling laws for language models.We also aim to extend our experiments to tasks other than NER, relation extraction, and text classification, such as question-answering and textual entailment in scientific domains.
A Extant BERT-based models Devlin et al. (2019) introduced BERT-Base and BERT-Large, with ∼110M and ∼340M parameters, as transformer-based masked language models conditioned on both the left and right contexts.Both are pretrained on the English Wikipedia + BooksCorpus datasets.SciBERT (Beltagy et al., 2019) follows the BERT-Base architecture and is pretrained on data from two domains, namely, biomedical science and computer science.SciBERT outperforms BERT-Base on finetuning tasks by an average of 1.66% and 3.55% on biomedical tasks and computer science tasks, respectively.
BioBERT (Lee et al., 2020) is a BERT-Base model with a pretraining corpus from PubMed abstracts and full-text PubMedCentral articles.Compared to BERT-Base, BioBERT achieves improvements of 0.62%, 2.80%, and 12.24% on biomedical NER, biomedical relation extraction, and biomedical question answering, respectively.
PubMedBERT (Gu et al., 2021), another BERT-Base model targeting the biomedical domain, is also pretrained on PubMed and PubMedCentral text.However, unlike BioBERT, PubMedBERT is trained as a new BERT-Base model, using text drawn exclusively from PubMed and PubMedCentral.As a result, the vocabulary used in Pub-MedBERT varies significantly from that used in BERT and BioBERT.Its pretraining corpus contains 3.1B words from PubMed abstracts and 13.7B words from PubMedCentral articles.PubMed-BERT achieves state-of-the-art performance on the Biomedical Language Understanding and Reasoning Benchmark, outperforming BERT-Base by 1.16% (Gu et al., 2021).
MatBERT (Trewartha et al., 2022) is a materials science-specific model pretrained on 2M journal articles (8.8B tokens).It consistently outperforms BERT-Base and SciBERT in recognizing materials science entities related to solid states, doped materials, and gold nanoparticles, with ∼10% increase in F1 score compared to BERT-Base, and a 1% to 2% improvement compared to SciBERT.
BatteryBERT (Huang and Cole, 2022) is a model pretrained on 400 366 battery-related publications (5.2B tokens).BatteryBERT has been shown to outperform BERT-Base by less than 1% on the SQuAD question answering task.For batteryspecific question-answering tasks, its F1 score is around 5% higher than that of BERT-base.

B ScholarBERT Pretraining Details B.1 Tokenization
The vocabularies generated for PRD_1 and PRD_10 differed only in 1-2% of the tokens; however, in an initial study, the PRD_100 vocabulary differed from that of PRD_10 by 15%.A manual inspection of the PRD_100 vocabulary revealed that many common English words such as "is," "for," and "the" were missing.We determined that these omissions were an artifact of PRD_100 being sufficiently large to cause integer overflows in the unsigned 32-bit-integer token frequency counts used by HuggingFace's tokenizers library.For example, "the" was not in the final vocabulary because the token "th" overflowed.Because WordPiece iteratively merges smaller tokens to create larger ones, the absence of tokens like "th" or "##he" means that "the" could not appear in the final vocabulary.
We modified the tokenizers library to use unsigned 64-bit integers for all frequency counts, and recreated a correct vocabulary for PRD_100.Interestingly, models trained on the PRD_100 subset with the incorrect and correct vocabularies exhibited comparable performance on downstream tasks.

B.2 RoBERTa Optimizations
RoBERTa introduces many optimizations for improving BERT pretraining performance (Liu et al., 2019).1) It uses a single phase training approach whereby all training is performed with a maximum sequence length of 512.2) Unlike BERT which randomly introduces a small percentage of shortened sequence lengths into the training data, RoBERTa does not randomly use shortened sequences.3) RoBERTa uses dynamic masking, meaning that each time a batch of training samples is selected at runtime, a new random set of masked tokens is selected; in contrast, BERT uses static masking, pre-masking the training samples prior to training.BERT duplicates the training data 10 times each with a different random, static masking.4) RoBERTa does not perform Next Sentence Prediction during training.5) RoBERTa takes sentences contiguously from one or more documents until the maximum sequence length is met.6) RoBERTa uses a larger batch size of 8192.7) RoBERTa uses byte-pair encoding (BPE) rather than WordPiece.8) RoBERTa uses an increased vocabulary size of 50 000, 67% larger than BERT.9) RoBERTa trains for more iterations (up to 500 000) than does BERT-Base (31 000).We adopt RoBERTa training methods, with three key exceptions.1) Unlike RoBERTa, we randomly introduce smaller length samples because many of our downstream tasks use sequence lengths much smaller than the maximum sequence length of 512 that we pretrain with.2) We pack training samples with sentences drawn from a single document, as the RoBERTa authors note that this results in slightly better performance.3) We use WordPiece encoding rather than BPE, as the RoBERTa authors note that BPE can result in slightly worse downstream performance.

B.3 Hardware and Software Stack
We perform data-parallel pretraining on a cluster with 24 nodes, each containing eight 40 GB NVIDIA A100 GPUs.In data-parallel distributed training, a copy of the model is replicated on each GPU, and, in each iteration, each GPU computes on a unique local mini-batch.At the end of the iter-  ation, the local gradients of each model replica are averaged to keep each model replica in sync.We perform data-parallel training of SCHOLARBERT models using PyTorch's distributed data-parallel model wrapper and 16 A100 GPUs.For the larger Speed data-parallel model wrapper and 32 A100 GPUs.The DeepSpeed library incorporates a number of optimizations that improve training time and reduced memory usage, enabling us to train the larger model in roughly the same amount of time as the smaller model.
We train in FP16 with a batch size of 32 768 for ∼33 000 iterations (Table 5).To achieve training with larger batch sizes, we employ NVIDIA Apex's FusedLAMB (NVIDIA, 2017) optimizer, with an initial learning rate of 0.0004.The learning rate is warmed up for the first 6% of iterations and then linearly decayed for the remaining iterations.We use the same masked token percentages as are used for BERT.Training each model requires roughly 1000 node-hours, or 8000 GPU-hours.
Figure 1 depicts the pretraining loss for each SCHOLARBERT model.We train each model past the point of convergence and take checkpoints throughout training to evaluate model performance as a function of training time.

C Evaluation Tasks
We evaluate the models on eight NER tasks and four sentence-level tasks.For the NER tasks, we use eight annotated scientific NER datasets: 1. BC5CDR (Li et al., 2016) The sentence-level downstream tasks are relation extraction on the ChemProt (biology) and SciERC (computer science) datasets, and sentence classification on the Paper Field (multidisciplinary) and Battery (materials) dataset: 1. ChemProt consists of 1820 PubMed abstracts with chemical-protein interactions annotated by domain experts (Peng et al., 2019).
3. The Paper Field dataset (Beltagy et al., 2019), built from the Microsoft Academic Graph (Sinha et al., 2015), maps paper titles to one of seven fields of study (geography, politics, economics, business, sociology, medicine, and psychology), with each field of study having around 12K training examples.
4. The Battery Document Classification dataset (Huang and Cole, 2022) includes 46 663 paper abstracts, of which 29 472 are labeled as battery and the other 17 191 as non-battery.The labeling is performed in a semi-automated manner.Abstracts are selected from 14 battery journals and 1044 non-battery journals, with the former labeled "battery" and the latter "non-battery."

D Extended Results
Table 6 shows average F1 scores with standard deviations for the NER tasks, each computed over five runs; Figure 2 presents the same data, with standard deviations represented by error bars.Table 7 and Figure 3 show the same for sentence classification tasks.The significant overlaps of error bars for NCBI-Disease, SciERC NER, Coleridge, SciERC Sentence Classification, and ChemProt corroborate our observation in Section 4 that on-domain pretraining provides only marginal advantage for downstream prediction over pretraining on a different domain or a general corpus.
Table 6: NER F1 scores for each of 14 models (rows), when the model is finetuned on eight different domain datasets and the resulting finetuned model applied to that dataset's associated NER task (columns).In each case, we give the average value and its standard deviation over five runs.
Figure 2: NER F1 scores from Table 6, with standard deviations represented by error bars.
Table 7: Sentence classification F1 scores for each of 14 models (rows), when the model is finetuned on one of four different domain datasets and the finetuned model is applied to that dataset's associated sentence classification task (columns).In each case, we give the average value and its standard deviation over five runs.

Figure 1 :
Figure 1: Pretraining loss plots for the SCHOLARBERT models listed in Table 1.The vertical dashed lines indicate the approximate locations of the iteration checkpoints selected for evaluation in Section 4.1.

Figure 3 :
Figure 3: Sentence classification F1 scores from Table 7, with standard deviations represented by error bars.

Table 1 :
(Devlin et al., 2019)e 14 BERT models considered in this study.The BERT-Base and -Large architectures are described in(Devlin et al., 2019); the BERT-XL architecture has 36 layers, hidden size of 1280, and 20 heads.Details of the pretraining corpora are in Table4in the Appendix.The domains are Bio=biomedicine, CS=computer science, Gen=general, Mat=materials science and engineering, and Sci=broad scientific.

Table 2 :
NER F1 scores for each model.Models are finetuned five times for each dataset and the average result is presented.Underlined results represent the F1-scores of models trained on in-distribution data for the given task, and bolded results indicate the best performing model on that task.SB = SCHOLARBERT.

Table 3 :
F1 scores for each model on Relation Extraction (SciERC, ChemProt) and Sentence Classification (PaperField, Battery) tasks.Models are finetuned five times for each dataset and the average result is presented.Underlined results represent the F1-scores of models trained on in-distribution data for the given task, and bolded results indicate the best performing model on that task.SB = SCHOLARBERT.

Table 4 :
Pretraining corpora used by models in this study.The domains are Bio=biomedicine, CS=computer science, Gen=general, Materials=materials science and engineering and Sci=broad scientific.