Pre-train or Annotate? Domain Adaptation with a Constrained Budget

Recent work has demonstrated that pre-training in-domain language models can boost performance when adapting to a new domain. However, the costs associated with pre-training raise an important question: given a fixed budget, what steps should an NLP practitioner take to maximize performance? In this paper, we study domain adaptation under budget constraints, and approach it as a customer choice problem between data annotation and pre-training. Specifically, we measure the annotation cost of three procedural text datasets and the pre-training cost of three in-domain language models. Then we evaluate the utility of different combinations of pre-training and data annotation under varying budget constraints to assess which combination strategy works best. We find that, for small budgets, spending all funds on annotation leads to the best performance; once the budget becomes large enough, a combination of data annotation and in-domain pre-training works more optimally. We therefore suggest that task-specific data annotation should be part of an economical strategy when adapting an NLP model to a new domain.


Introduction
The conventional wisdom on semi-supervised learning and unsupervised domain adaptation is that labeled data is expensive; therefore, training on a combination of labeled and unlabeled data is an economical approach to improve performance when adapting to a new domain (Blum and Mitchell, 1998;Daume III and Marcu, 2006;Hoffman et al., 2018;. Recent work has shown that pre-training in-domain Transformers is an effective method for unsupervised adaptation (Han and Eisenstein, 2019;Wright and Augenstein, 2020) and even boosts performance x F I U z 9 a / W v 1 E 3 g p t g E Q S h z B R R l 6 X d u K x g 2 4 F 2 G D J p p g 3 N Z G K S E c p Q N 7 6 K G x e K u P U t 3 P k 2 p u 0 s t P X A h Y 9 z 7 y W 5 J x S M K u 0 4 3 1 Z h Z X V t f a O 4 W d r a 3 t n d s / c P 2 i p J J S Y t n L B E e i F S h F F O W p p q R j w h C Y p D R j r h q D H t d x 6 I V D T h d 3 o s i B + j A a c R x U g b K 7 C P v A D B h q l z 6 A X C k I A 9 R u 5 h P b D L T s W Z C S 6 D m 0 M Z 5 G o G 9 l e v n + A 0 J l x j h p T q u o 7 Q f o a k p p i R S a m X K i I Q H q E B 6 R r k K C b K z 2 Y X T O C p c f o w S q Q p r u H M / b 2 R o V i p c R y a y R j p o V r s T c 3 / e t 1 U R 9 d + R r l I N e F 4 / l C U M q g T O I 0 D 9 q k k W L O x A Y Q l N X + F e I g k w t q E V j I h u I s n L 0 O 7 W n E v K 9 X b i 3 K t n s d R B M f g B J w B F 1 y B G r g B T d A C G D y C Z / A K 3 q w n 6 8 V 6 t z 7 m o w U r 3 z k E f 2 R 9 / g A a c p T B < / l a t e x i t > X a C a + X p C p  B Figure 1: We view domain adaptation as a consumer choice problem (Becker, 1965;Lancaster, 1966). The NLP practitioner (consumer) is faced with the problem of choosing an optimal combination of annotation and pre-training under a constrained budget. This figure is purely for illustration and is not based on experimental data.
when large quantities of in-domain data are available (Gururangan et al., 2020). However, modern pre-training methods incur substantial costs (Izsak et al., 2021), and generate carbon emissions (Strubell et al., 2019;Schwartz et al., 2020;Bender et al., 2021). This raises an important question: given a fixed budget to improve a model's performance, what steps should an NLP practitioner take? On one hand, they could hire annotators to label in-domain task-specific data, while on the other, they could buy or rent GPUs or TPUs to pre-train large in-domain language models. In this paper, we empirically study the best strategy for adapting to a new domain given a fixed budget. We view the NLP practitioner's dilemma of how to adapt to a new domain as a problem of consumer choice, a classical problem in microeconomics (Becker, 1965;Lancaster, 1966). As illustrated in Figure 1, the NLP practitioner (consumer) can obtain X a annotated documents (by hiring annotators) at a cost of C a each, and X p hours of pre-training (by renting GPUs or TPUs) at a cost of C p per hour. Given a fixed budget B, the consumer may choose any combination that fits within the budget con-straint X a C a + X p C P ≤ B. The goal is to choose a combination that maximizes the utility function, U (X a , X p ), which can be defined using an appropriate performance metric, such as F 1 score, that is achieved after pre-training for X p hours and then fine-tuning on X a in-domain documents.
To empirically estimate the cost of annotation, we hire annotators to label domain-specific documents for supervised fine-tuning in three procedural text domains: wet-lab protocols, paragraphs describing scientific procedures in PubMed articles, and chemical synthesis procedures described in patents. We choose to target natural language understanding for scientific procedures in this study, because there is an opportunity to help automate lab protocols and support more reproducible scientific experiments, yet few annotated datasets currently exist in these domains. Furthermore, annotation of scientific procedures is not easily amenable to crowdsourcing, making this an ideal testbed for pretraining-based domain adaptation. We measure the cost of in-domain pre-training on a large collection of unlabeled procedural texts using Google's Cloud TPUs. 2 Model performance is then evaluated under varying budget constraints in six source and target domain combinations.
Our analysis suggests that given current costs of pre-training large Transformer models, such as BERT (Devlin et al., 2019), and RoBERTa (Liu et al., 2019), in-domain data annotation should always be part of an economical strategy when adapting a single NLP system to a new domain. For small budgets (e.g. less than $800 USD), spending all funds on annotation is the best policy; however, as more funding becomes available, a combination of pre-training and annotation is the best choice. This paper addresses a specific question that is often faced by NLP practitioners working on applications: what is the most economical approach to adapt an NLP system to a new domain when no pre-trained models or task-annotated datasets are initially available? If multiple NLP systems need to be adapted to a single target domain, model costs can be amortized, making pre-training an attractive option for smaller budgets.

Scope of the Study
In this study, we focus on a typical scenario faced by an NLP practitioner to adapt a single NLP system to a single new domain, maximizing perfor-mance within a constrained budget. We consider only the direct benefit on the target task in our main analysis ( §5), however we do provide additional analysis of positive externalities on other related tasks that may benefit from a new pre-trained model in §6.
We estimate cost based on two major expenses: annotating task-specific data ( §3) and pre-training domain-specific models using TPUs ( §4). Note that fine-tuning costs are not included in our analysis, as they are nearly equal whether budget is invested into pre-training or annotation. 3 We assume a generic BERT is the closest zero-cost model that is initially available, which is likely the case in real-world domain adaptation scenarios (especially for non-English languages). Our experiments are designed to simulate a scenario where no domainspecific model is initially available. We also assume that the NLP engineer's salary is a fixed cost; in other words, their salary will be the same whether they spend time pre-training models or managing a group of annotators. 4 Our primary concerns are about financial and environmental costs, rather than the overall time needed to obtain the adapted model. If the timeline is an important factor, the annotation process can be possibly sped up by hiring more annotators.

Estimating Annotation Cost (C a )
In this section, we present our estimates of the annotation cost for three procedural text datasets from specialized scientific domains, which enable a comparison of model performance under varying budget constraints ( §5.4).
Annotated Procedural Text Datasets. We experiment with three procedural text corpora, including Wet Lab Protocols (WLP; Tabassum et al., 2020) and two new datasets we created for this study, which include scientific articles and chemical patents. Statistics of the three datasets are shown in Table 1. The WLP corpus includes 726 wet lab experiment instructions collected from protocols.io which are annotated using an inventory of 20 entity types and 16 relation types. Following the same annotation scheme, we annotate PUBMEDM&M and CHEMSYN. The   Table 1). More details on data pre-processing, annotation and interannotator agreement scores can be found in Appendix A Annotation Cost. We recruit undergraduate students to annotate the datasets using the BRAT annotation tool. 5 Annotators are paid 13 USD / hour throughout the process, which is the standard rate for undergraduate students at our university. Estimates of the cost of annotation, C a , per-sentence are presented in Table 1. 6 4 Estimating Pre-training Cost (C p ) To evaluate varied strategies for combining pretraining and annotation given a fixed budget, we need accurate estimates on the cost of annotation, C a , and pretraining, C p . Having estimated the cost of annotating in-domain procedural text corpora in §3, we now turn to estimate the cost of in-domain pretraining. Specifically, we consider two popular approaches: 1) training an in-domain language model from scratch; 2) continued pre-training using an off-the-shelf model.

PROCEDURE Corpus Collection.
To pre-train our models, we create a novel collection of procedural texts from the same domains as the annotated data in §3, hereinafter referred to as the PROCE-DURE corpus. Specially trained classifiers were used to identify paragraphs describing experimental procedures. For PubMed, a classifier was used to identify paragraphs describing experimental procedures by fine-tuning SciBERT (Beltagy et al., 2019) on the SciSeg dataset (Dasigi et al., 2017), which is annotated with scientific discourse structure, to extract procedures from the Materials and Methods section of 680k articles. For the chemical synthesis domain, the chemical reaction extractor developed by Lowe (2012) was applied to the Description section of 303k patents (174k U.S. and 129k European) we collected from USPTO 7 and EPO 8 . More details of our data collection process can be found in Appendix B.
Cooking recipes are also an important domain for research on procedural text understanding, therefore we include the text component of the Recipe1M+ dataset (Marín et al., 2021) in the PRO-CEDURE pre-training corpus. In total, our PRO-CEDURE collection contains around 1.1 billion words; more statistics are shown in Table 2. In addition, we create an extended version, PROCE-DURE+, consisting of 12 billion words, where we up-sample the procedural paragraphs 6 times and combine them with the original full text of 680k PubMed articles and 303k chemical patents. This up-sampling ensures at least half of the text is pro-cedural.
Pre-training Process and Cost. We train two procedural domain language models on the Google Cloud Platform using 8-core v3 TPUs: 1) ProcBERT, a BERT base model pre-trained from scratch using our PROCEDURE+ corpus, and 2) Proc-RoBERTa, for which we continued pretraining RoBERTa base on the PROCEDURE corpus following Gururangan et al. (2020).
We pre-train ProcBERT using the TensorFlow codebase of BERT. 9 Following Devlin et al. (2019), we deploy a two-step regime: the model is trained with sequence length 128 and batch size 512 for 1 million steps at a rate of 4.71 steps/second. Then, it is trained for 100k more steps using sequences of length 512 and a batch size of 256 at a rate of 1.83 steps/second. The pretraining process takes about 74 hours, and the total cost is about 620 USD, which includes the price for on-demand TPU-v3s (8 USD/hour) 10 plus auxiliary costs for virtual machines and data storage.
We considered the possibility of evaluating checkpoints of partially pre-trained models, for fine-grained variation of the pre-training budget, however after some investigation we chose to only report results on fully pre-trained models, using established training protocols (learning rate, number of parameter updates, model size, sequence length, etc.) to ensure fair comparison.
In addition to pre-training from scratch, we also experiment with Domain-Adaptive Pre-training, using the codebase 11 released by AI2 to train Proc-RoBERTa. Similar to Gururangan et al. (2020), we fine-tune RoBERTa on our collected PROCE-DURE corpus for 12.5k steps with the averaged speed of 27.27 seconds per step, which leads to a TPU time of 95 hours. 12 Thus, the total cost of Proc-RoBERTa is around 800 USD after including the auxiliary expenses.
Finally, we estimate the cost of training for SciB-ERT (Beltagy et al., 2019), which was also trained on an 8-core TPU v3 using a two-stage training process similar to ProcBERT. The overall training of SciBERT took 7 days (5 days for the first stage and 2 days for the second stage) with an estimated 9 https://github.com/google-research/ bert 10 https://cloud.google.com/tpu/pricing 11 https://github.com/allenai/tpu_ pretrain 12 This is comparable to the number reported by the authors of Gururangan et al. (2020) on GitHub. cost of 1,340 USD.
Carbon Footprint. Apart from the financial cost, we also estimate the carbon footprint of each indomain pre-trained language model for its environmental impact. We measure the energy consumption in kilowatt-hours (KWh) as in Patterson et al. (2021): where H is the number of training hours, N is the number of processors used, P is the average power per processor, 13 and P U E (Power Usage Effectiveness) indicates the energy usage efficiency of a data center. In our case, the average power per TPU v3 processor is 283 watts, and we use a P U E coefficient of 1.10, which is the average trailing twelve-month P U E reported for all Google data centers in Q1 2021. 14 Once we know the energy consumption, we can estimate the CO 2 emissions (CO 2 e) as follows: where CO 2 e/KW h measures the amount of CO 2 emission when consuming 1 KWh energy, which is 474g/KWh for our pre-training. 15 For example, ProcBERT is pre-trained on a single 8-core TPU v3 for 74 hours, resulting in CO 2 emission of (74 × 8 × 283 × 1.10/1000) × 474/1000 = 87.4 kg. The estimated CO 2 emissions for three in-domain language models are shown in Table 3.
Varying Budget Constraints Given the estimated unit cost of annotation C a ( §3) and pre-training C p ( §4), we now empirically evaluate the utility U (X a , X p ), of various budgets and pre-training strategies to find an optimal policy for domain adaptation that fits within the budget constraint X a C a + X p C P ≤ B.  Table 3: Carbon footprint of three in-domain pretrained language models. CO 2 e is the number of metric tons of CO 2 emissions with the same global warming potential as one metric ton of another greenhouse gas.

NLP Tasks and Models
We experiment with two NLP tasks, Named Entity Recognition (NER) and Relation Extraction (RE). For NER, we follow Devlin et al. (2019) to feed the contextualized embedding of each token into a linear classification layer. For RE, we follow Zhong and Chen (2020), inserting four special tokens specifying positions and types of each entity-pair mention, which are included as input to a pre-trained sentence encoder. Gold entity mentions are used in our relation extraction experiments, to reduce variance due to entity recognition errors.

Budget-constrained Experimental Setup
As we have three procedural text datasets ( §3) annotated with entities and relations, we can experiment with six source ⇒ target adaptation settings. For each domain pair, we compare five different pre-trained language models when adapted to the procedural text domain under varying budgets. Based on the estimations of the annotation costs C a ( §3) and pre-training costs C p ( §4), we conduct various budget-constrained domain adaptation experiments. For example, if we have $1,500 and the PUBMEDM&M corpus, to build an NER model that works best for the CHEMSYN domain (PUBMEDM&M⇒CHEMSYN), we can spend all $1,500 to annotate 2,500 in-domain sentences to fine-tune off-the-shelf BERT. Or alternatively, we could first spend $800 to pre-train Proc-RoBERTa, then fine-tune it on 1155 sentences annotated in the CHEMSYN domain using the remaining of $700. Under both budgeting strategies, an additional experiment is performed to choose one of two domain adaption methods that maximizes performance: 1) a model that simply uses the annotated data in the target domain for fine-tuning; or 2) a model which is fine-tuned using a variant of EasyAdapt (Daumé III, 2007) to leverage annotated data in both the source and target domains (see below for details). We select the approach that has better development set performance and report its test set result in Table 4 and 5 (see Appendix D for more details about hyper-parameters).

EasyAdapt
In most of our experiments, we have access to a relatively large amount of labeled data from a source domain, and varying amounts of data from the target domain. Instead of simply concatenating the source and target datasets for fine-tuning, we propose a simple, yet novel variation of EasyAdapt (Daumé III, 2007) for pre-trained Transformers. More specifically, we create three copies of the model's contextualized representations: one represents the source domain, one represents the target, and the third is domain-independent. These contextualized vectors are then concatenated and fed into a linear layer that is 3 times as large as the base model's. When encoding data from a specific domain (e.g. CHEMSYN), the other domain's representations are zeroed out (1/3 of the new representations will always be 0.0). This enables the domain-specific block of the linear layer to encode information specific to that domain, while the domain-independent parameters can learn to represent information that transfers across domains. This is similar to prior work using EasyAdapt (Kim et al., 2016) for LSTMs.

Experimental Results and Analysis
We present the test set NER and RE results with five annotation and pre-training combination strategies under six domain adaptation settings in Table 4 and 5, respectively. We report averages across five random seeds with standard deviations as subscripts. If pre-training costs go over the total budget, or available data goes under the annotation budget, we  indicate the result as "NA". We now discuss a set of key questions regarding pre-training-based domain adaptation under a constrained budget, which our experiments can shed some light on.
Should we prioritize pre-training or data annotation for domain adaptation? For all six domain adaptation settings in Table 4, spending the entire budget on annotation and using the off-theshelf language model BERT large works the best for NER when the budget is 700 USD, showing the effectiveness of data annotation in low-resource scenarios. As the budget increases, performance gains from labelling additional data diminish, and pre-training in-domain language models takes the lead. ProcBERT, which is pre-trained from scratch on the PROCEDURE+ corpus costing only 620 USD, performs best at budgets of 1500 and 2300 USD. This demonstrates that combining domainspecific pre-training with data annotation is the best strategy in high-resource settings. Similarly for RE, as shown in Table 5, using all funds for data annotation and working with off-the-shelf models achieves better performance at lower budgets, while domain-specific pre-training starts to excel as the budget increases past a certain point.
What is the starting budget to consider pretraining an in-domain language model? To an-swer this question, we plot test set NER performance for two strategies, BERT large (investing all funds on annotation) and ProcBERT (combining annotation with pre-training), against varying budgets in Figure 2. Specifically, the budget of each strategy starts with the pre-training cost of its associated language model, and is increased by 155 USD increments until the total budget reaches the total cost of available data 17 . We observe a similar trend for both PUBMEDM&M⇒CHEMSYN and CHEMSYN⇒WLP: annotation alone works better in lower budgets while in-domain pre-training (ProcBERT) excels at higher budgets. However, the intersection of the curves for these two strategies occurs at different points, which are around 1085 USD and 775 USD. We hypothesize there are two reasons for this difference: 1) each target domain may require different amounts of labeled data to generalize well; 2) the quantity of labeled data from the source domain may also impact the need for data annotation in the target domain. To testify our hypotheses, we evaluate the utility of annotation vs. pre-training where no source-domain data is available in Figure  3. This is almost identical to the setting of Figure  BERTbase BERTlarge   Table 4, regardless of a small or large budget, prioritizing data annotation in the target domain is the most beneficial. † indicates results using EasyAdapt ( §5.3), where source domain data helps.  allocate all available funds to data annotation; 2) pre-train ProcBERT on in-domain data, then use the remaining budget for annotation. For small budgets, the former yields the best performance on NER, but as the budget increases, the later becomes the best choice.
2 except models are trained on target domain labeled data only. Here, we observe the intersection for CHEMSYN is still around 1085 USD while the crossover point for WLP moves from the original 775 USD (in Figure 2) to around 1395 USD. Our hypothesis is that WLP is a broader domain compared to CHEMSYN (WLP covers a more diverse range of protocols that include cell cultures, DNA sequencing, etc.), so it requires more annotated data to perform well under the setting of Figure  3. However, when adapted from a large source domain dataset like CHEMSYN, the need for an-notated WLP corpus is reduced so that ProcBERT can outperform BERT large at a lower budget.
Note that our estimated annotation cost, C a , (per sentence) includes the annotation of both entity mentions and relations, so our analysis amortizes the cost of pre-training across both tasks. In a scenario where more tasks need to be adapted for the target domain, this could be accounted for simply by dividing the cost of pre-training among tasks, which would shift the black curves in Figure 2 and Figure 3 to the left, making pre-training an eco- ProcBERT ($620)  ) and pre-training followed by indomain annotation (  Table 6: Test set F 1 on NER for entities seen and unseen in the training data for BERT large and ProcBERT, when the two achieve very similar overall performance under the same budget constraints in Figure 3. ProcBERT performs better on the unseen entities.
nomical choice at lower budgets. 18 When using the same budget and achieving similar F 1 , how do pre-training and annotation differ? In the previous experiments, we show that in-domain pre-training is an effective domain adaptation method especially in high-budget settings. ProcBERT can work very well when trained with less labeled data. A plausible explanation is that in-domain pre-training improves generalization to new entities in the target domain, whereas additional annotation improves the performance on entities that are observed in the training corpus. To evaluate this hypothesis, we compare model predictions of the two strategies at the crossover points 18 For more discussion on our assumptions, see §2.
in Figure 3 19 , and consider each entity in the test set as "Seen" or "Unseen" based on whether it was observed in the training set. Then, we calculate the F 1 score for each category as shown in Table 6. Although the models that are compared achieve nearly identical overall performance, the decomposition of performance on seen and unseen entities in Table  6 clearly suggests that in-domain pre-training leads to better generalization on unseen entities, whereas allocating more budget to annotation boosts performance on entities that were seen during training. This may help provide an explanation for the main finding of this paper: in-domain pre-training results in better generalization on unseen mentions, leading to better marginal utility, but only after enough in-domain annotations are observed to fully cover the head of the distribution.  Table 7: Test set F 1 on six procedural text datasets. The best task performance is boldfaced, and the second-best performance is underlined. For the SOTA model of each dataset, we refer readers to the corresponding paper for further details: Tamari et al. (2021) for XWLP, Wang et al. (2020) for CHEMU, Gupta and Durrett (2019) for RECIPE, Knafou et al. (2020) for NER on WLP, and Sohrab et al. (2020) for RE on WLP.

Ancillary Procedural NLP Tasks
In addition to the procedural text datasets discussed in §5, we experiment with three ancillary procedural text corpora, to explore how in-domain pretraining can benefit other tasks.
The CHEMU corpus (Nguyen et al., 2020) contains NER and event annotations for 1500 chemical reaction snippets collected from 170 English patents. Its NER task focuses on identifying chemical compounds, and its event extraction (EE) task aims at detecting chemical reaction events including trigger detection and argument role labeling.
The XWLP corpus (Tamari et al., 2021) provides the Process Event Graphs (PEG) of 279 wet-lab biochemistry protocols. The PEG is a document-level graph-based representation specifying the involved experimental objects.
The RECIPE corpus (Kiddon et al., 2016) includes annotation of entity states for 866 cooking recipes. It supports Entity Tracking (ET) task which predicts whether or not a specific ingredient is involved in each step of the recipe.

Experiments on Ancillary Tasks
For CHEMU, gold arguments are provided, so we only need to identify the event trigger and predict the role of the gold arguments. An event prediction is correct if the event trigger, associated arguments, and their roles match with the gold event mention. We tackle this task using a pipeline model similar to Zhong and Chen (2020). For XWLP, we focus on the operation argument role labeling task, where gold entities are provided as input. Following Tamari et al. (2021), we decompose the results into "Core" and "Non-Core" roles. For the RECIPE task, we follow the data splits and fine-tuning ar-chitecture of Gupta and Durrett (2019). The state of an ingredient in each cooking step is correct if it matches with the gold labels, as either present or absent.
Results. Test set results of eight pre-trained language models on six procedural text datasets are presented in Table 7. 20 ProcBERT, performs best in most tasks and even achieves the state-off-the-art performance on operational argument role labeling ("Core" and "Non-Core") of XWLP, showing the effectiveness of in-domain pre-training.

Conclusion
In this paper, we address a number of questions related to the costs of adapting an NLP model to a new domain (Blitzer et al., 2006;Han and Eisenstein, 2019), an important and well-studied problem in NLP. We frame domain adaptation under a constrained budget as a problem of consumer choice. Experiments are conducted using several pre-trained models in three procedural text domains to determine when it is economical to pre-train indomain transformers (Gururangan et al., 2020), and when it is better to spend available resources on annotation. Our results suggest that when a small number of NLP models need to be adapted to a new domain, pre-training, by itself, is not an economical solution.
tance with extracting experimental paragraphs from PubMed, and John Niekrasz for sharing the output of Daniel Lowe's reaction extraction tool on European patents. This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001119C0108, in addition to the NSF (IIS-1845670) and IARPA via the BETTER program . The views, opinions, and/or findings expressed are those of the author(s) and should not be interpreted as representing the official views or policies of the Department of Defense, IARPA or the U.S. Government. This work is approved for Public Release, Distribution Unlimited.

A Data Annotation
We annotate two datasets PUBMEDM&M and CHEMSYN in the domain of scientific articles and chemical patents mainly following the annotation scheme of the Wet Lab Protocols (WLP; Tabassum et al., 2020). On top of 20 entity types and 16 relation types in WLP, we supplement four entity types (COMPANY, SOFTWARE, DATA-COLLECTION and INFO-TYPE) and one relation type (BELONG-TO) due to two key features of our corpus: 1) scientific articles usually specify the provenance of reagents for better reproducibility; 2) it covers a broader range of procedures such as computer simulation and data analysis.
We recruit four undergraduate students to annotate the datasets using the BRAT annotation tool. 21 We double-annotate all files in PUBMEDM&M and half of the files in CHEMSYN. For those doubleannotated files, the coordinator will discuss the annotation with each annotator making sure their annotation follows the guideline and dissolve the disagreement. As for the inter-annotator agreement (IAA) score, we treat the annotation from one of the two annotator as the gold label, and the other annotation as the predicted label, and then use the F1 scores of Entity(Action) and Relation evaluations as the final inter-annotator agreement scores, which are shown in Table 8. We can see that CHEMSYN has higher IAA scores, and there are two potential reasons: 1) we annotate PUBMEDM&M first, so the annotators might be more experienced when they annotate CHEMSYN; 2) PUBMEDM&M contains more diverse content like wet lab experiments or computer simulation procedures while CHEM-SYN is mainly about chemical synthesis.

B Procedural Corpus Collection
PubMed Articles. The first source of our procedural corpus is PubMed articles because they contain a large number of freely accessible experimental procedures. Specifically, we extract procedural 21 https://github.com/nlplab/brat paragraphs from the Materials and Methods section of articles within the Open Access Subset of PubMed. XML files containing full text of articles are downloaded from NCBI 22 and then processed to obtain all the paragraphs within the Materials and Methods section.
To improve the quality of our collected corpus, we develop a procedural paragraph extractor by fine-tuning SciBERT (Beltagy et al., 2019) on the SciSeg dataset (Dasigi et al., 2017), which includes discourse labels ({Goal, Fact, Result, Hypothesis, Method, Problem, Implication}) for PubMed articles. This extractor achieves an average F1 score of 72.65% in a five-fold cross validation, and we run it on all acquired paragraphs. We consider a paragraph as a valid procedure if at least 40% of clauses are labeled as Method. This threshold is obtained by manual inspection of the randomly sampled subset of the data.
In total, the PubMed Open Access Subset contains 2,542,736 articles, of which about 680k contain a Materials and Methods section. After running our trained procedural paragraph extractor, we retain a set of 1,785,923 procedural paragraphs. Based on a manual inspection of the extracted paragraphs, we estimate that 92% consist of instructions for carrying out experimental procedures. the language identification tool langid 25 to build a English-only corpus, which includes 3,671,482 paragraphs (90.9%).

C Pre-training Details
We pre-train ProcBERT using the TensorFlow codebase of BERT (Devlin et al., 2019). 26 We use the Adam optimizer (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.999 and L2 weight decay of 0.01. Following Devlin et al. (2019), we deploy the two-step regime. In the first step, we pre-train the model with sequence length 128 and batch size 512 for 1 million steps. The learning rate is warmed up over the first 100k steps to a peak value of 1e-4, then linearly decayed. In the second step, we train 100k more steps of sequence length 512 and batch size 256 to learn the positional embeddings with peak learning rate 2e-5. We use the original sub-word mask as the masking strategy, and we mask 15% of tokens in the sequence for both training steps.
For Proc-RoBERTa, we use the codebase from AI2, 27 which enables language model pre-training on TPUs with PyTorch. Similar to Gururangan et al. (2020), we train RoBERTa on our collected procedural text corpus for 12.5k steps with a learning rate of 3e-5 and an effective batch size 2048, which is achieved by accumulating the gradient of 128 steps with a basic batch size of 16. The input sequence length is 512 throughout the whole process, and 15% of words are masked for prediction.

D Hyper-parameters for Downstream
Tasks