SciRepEval: A Multi-Format Benchmark for Scientific Document Representations

Learned representations of scientific documents can serve as valuable input features for downstream tasks without further fine-tuning. However, existing benchmarks for evaluating these representations fail to capture the diversity of relevant tasks. In response, we introduce SciRepEval, the first comprehensive benchmark for training and evaluating scientific document representations. It includes 24 challenging and realistic tasks, 8 of which are new, across four formats: classification, regression, ranking and search. We then use this benchmark to study and improve the generalization ability of scientific document representation models. We show how state-of-the-art models like SPECTER and SciNCL struggle to generalize across the task formats, and that simple multi-task training fails to improve them. However, a new approach that learns multiple embeddings per document, each tailored to a different format, can improve performance. We experiment with task-format-specific control codes and adapters and find they outperform the existing single-embedding state-of-the-art by over 2 points absolute. We release the resulting family of multi-format models, called SPECTER2, for the community to use and build on.


Introduction
Learning representations of documents is critical for a variety of NLP tasks including classification, search, and recommendation (Cohan et al., 2020).Recent work has shown how pretrained language models (e.g., (Devlin et al., 2019;Raffel et al., 2020;Brown et al., 2020)) can be tailored to produce high-quality representations of documents with contrastive learning (Xu et al., 2021;Gao et al., 2021;Neelakantan et al., 2022).In the scientific domain, training objectives based on contrastive learning of cross-document links (e.g., citations) have shown further improvements in document- 1 Allen Institute for Artificial Intelligence, Seattle, WA, USA 2 Northwestern University, IL, USA 3 Yale University, CT, USA.Correspondence to: Amanpreet Singh <amanpreets@allenai.org>.level representation learning (Cohan et al., 2020;Ostendorff et al., 2022b;Mysore et al., 2022).These methods are especially useful because the representations they produce can be indexed and later efficiently consumed by lightweight downstream models without additional fine-tuning.
While there has been significant progress in evaluating generalizability of NLP models (Ye et al., 2021;Sanh et al., 2021), evaluation of scientific document representations has remained limited.Existing benchmarks either focus on document similarity (Mysore et al., 2021;Voorhees et al., 2021) or tasks that are highly correlated and not diverse (Cohan et al., 2020).Further, as we show in our experiments, models that work well on general-purpose text embedding benchmarks such as the recent MTEB (Muennighoff et al., 2022) may not perform well on scientific tasks.
We introduce SciRepEval, the first benchmark for comprehensive evaluation of document-representation models in the scientific domain.Unlike prior work, SciRepEval is large and includes a collection of highly diverse tasks, thus encouraging research on generalization (instance-level, cross-task and cross-domain).It consists of 25 realistic tasks that reflect practical use cases of scientific document representations across four formats: text classification, regression, proximity-based ranking (e.g., nearest-neighbor), and ad-hoc search.Eleven of these are new contributions.SciRepEval contains standard sets of both training and evaluation datasets to simplify and standardize comparisons between methods evaluated on the benchmark.
We then use this new benchmark to investigate and improve the generalization ability of document representation models.Following recent work (Cohan et al., 2020;Ostendorff et al., 2022b;Mysore et al., 2022) we further pre-fine-tune a transformer model (SciNCL; (Ostendorff et al., 2022b)) to produce high-quality representations for downstream tasks.We hypothesize that condensing all relevant information of a document into a single vector representation might not be expressive enough for generalization across a wide range of tasks.Prior work addresses a similar challenge in the context of document similarity by learning multiple representations associated with a different aspect of a paper (e.g., task, method, results) (Mysore et al., 2022;Ostendorff et al., 2022a).In contrast, we aim to learn effective representations for multiple downstream task formats.
Following recent success in multi-task learning in NLP (Ye et al., 2021;Sanh et al., 2021), we explore largescale multi-task training in the context of scientific document representations, where we apply suitable optimization objectives for the various task formats in SciRepEval.i.e., cross-entropy loss for classification, triplet loss for proximity/ad-hoc search, and mean square error loss for regression.We explore two state-of-the-art techniques for generating format-specific document representations: using control codes (Keskar et al., 2019;Raffel et al., 2020) as input indicating the format, and parameter-efficient adapter methods (Houlsby et al., 2019;Pfeiffer et al., 2021;Stickland & Murray, 2019), in which a separate network module is introduced for every task format.
Our experiments investigate: (i) if existing document representation methods have the ability to generalize to a highly diverse set of tasks, (ii) if multi-task training on diverse data can improve document representation models, and (iii) if task-format-specific representations can improve generalization.Through extensive analysis we find that existing state-of-the-art scientific document representation models such as SPECTER (Cohan et al., 2020) and SciNCL (Ostendorff et al., 2022b) struggle with generalizing to all task types.We interestingly find that simple multi-task training on large set of tasks is not able to significantly improve the results.However, we learn that multiple task format-specific representations can substantially improve generalization.
To summarize, our contributions are: (ii) An extensive investigation on generalizability of stateof-the-art scientific document representation models.
(iii) A set of new multi-task document representation models that unlike existing methods, can produce representations tailored to different task formats.The new methods show improved generalization, outperforming prior work by up to 1.5 points absolute.
We release the benchmark and associated code for training and evaluation to encourage further research in this area: https://anonymous.4open.science/r/scirepeval-6FDC/.

Background
Representing Scientific Documents Earlier work aimed at document embeddings used word vectors (J et al., 2016;Le & Mikolov, 2014;Wu et al., 2018), convolutions (Liu et al., 2017;Zamani et al., 2018), bi-encoder networks (Conneau et al., 2017) and BERT-based methods (Reimers & Gurevych, 2019).Recent works have produced large scale language models pre-trained on scientific corpora (Beltagy et al., 2019;Yasunaga et al., 2022;Trewartha et al., 2022).These tend to perform better than general purpose models on scientific domain tasks, and serve as a foundation for learning dense embeddings of scientific documents.Cohan et al. (2020) and Ostendorff et al. (2022b) fine-tune SciB-ERT (Beltagy et al., 2019) with a triplet loss that encourages papers citing each other to have similar embeddings, using the title and abstract of research papers as the input.
Both Cohan et al. (2020) and Ostendorff et al. (2022b) are evaluated on SciDocs.However, as we discuss further in section 3 and Appendix F, this benchmark has important limitations.In contrast, SciRepEval provides a more challenging and diverse set of tasks, for both training and evaluation to help motivate methods for producing scientific document representations that can do well across multiple task formats.We attempt to learn task-specific embeddings of the documents by pre-fine-tuning on multiple objectives simultaneously.Prior work has proposed techniques in learning multiple embeddings per paper (Ostendorff et al., 2022a;Mysore et al., 2022).These methods are, however, orthogonal to ours in that they generate an embedding per paper "facet", while we focus on learning separate embeddings per task format.In addition, these techniques focus only on finer-grained paper similarity, while our aim is producing general embeddings applicable to a variety of task formats.
Multi-Task Learning Across Formats Multi-task learning (Caruana, 1993) with deep neural networks has been shown to improve performance over single-task training for related objectives (Liu et al., 2015;2019b).Though unrelated tasks can lead to negative transfer, recent work has shown that simply increasing the number of tasks tends to yield better performance in multi-task learning (Aghajanyan et al., 2021;Aribandi et al., 2022;Padmakumar et al., 2022).Aghajanyan et al. (2021) pre-fine-tune pre-trained language models simulataneously on 46 tasks across 4 task types before fine-tuning on the downstream task.Aribandi et al. (2022) pre-train T5 (Raffel et al., 2020) on a combination of C4 span denoising and 107 other tasks across 8 task families.Ye et al. (2021) introduce an ontology of 160 tasks for few shot multi-task training.Unlike these task families, which are divided primarily by semantics (e.g., classifying sentiment vs classifying entailment), the training tasks in SciRepEval consist of 8 large-scale scientific datasets across the four task formats.Since our goal is to evaluate final document representations, rather than fine-tune on individual downstream tasks like the above approaches, we follow SPECTER (Cohan et al., 2020) and directly apply the representations as features to the tasks.
Adapters for Multiple Tasks Adapters were introduced by Houlsby et al. (2019) for parameter efficient training of transformers (Vaswani et al., 2017).A small number of trainable parameters are added to each layer, while freezing the base encoder.This is similar to ELMo (Peters et al., 2018), which learned task-specific weightings for the biLSTM layers.

SciRepEval
We introduce SciRepEval, a benchmark suite of 25 tasks across four formats for training and evaluating multi-task embeddings of scholarly papers.SciRepEval aims to enable comprehensive evaluation of paper embeddings by providing (1) a highly diverse set of tasks-spanning multiple task formats such as classification, regression, proximity and adhoc search-to challenge the general-purpose applicability of embeddings, (2) realistic tasks that reflect practical use cases of paper embeddings, and (3) a standard set of both training and evaluation datasets to simplify comparisons between methods evaluated on the benchmark.
The previous scholarly paper embedding benchmark is Sci-Docs (Cohan et al., 2020), which includes two classification tasks, four nearest neighbors tasks, and one recommendation task.SciRepEval includes SciDocs as a subset, but addresses several key limitations: (i) The four nearest neighbor tasks in SciDocs are built to distinguish a related document from random negatives given a query document, which might be too easy and not representative of real tasks in scholarly information retrieval.SciRepEval has more realistic tasks like search, author disambiguation, and paper-reviewer matching among others.
(ii) For the methods evaluated in section 5, we found that the SciDocs recommendations task was noisy and had limited power to distinguish different embeddings.The test set includes only 1000 clickthrough events, and the use of propensity weighting means an even fewer examples dominate test performance.While SciRepEval includes SciDocs as a subset, we exclude the recommendations task.
(iii) The tasks in SciDocs were constructed to be used only for evaluation, and have few-enough samples that training on SciDocs is impractical (see Table 1).In SciRepEval, eight of the largest tasks across the four formats are used for training, while the rest out-of-train tasks are reserved for evaluation.This enables the study of multi-task approaches, rather than relying solely on the citation signal.The training data in SciRepEval also has a large scale representation in multiple domains as discussed in Appendix D.
(iv) Four of the tasks in SciDocs have very high modelperformance correlations between them (greater than 0.99), indicating that the diversity of the tasks is limited.See Appendix F for more details.
The tasks in SciRepEval are summarized in Table 1.They are a mixture of existing and new datasets.Datasets with at least 100,000 instances (triplets for proximity/ad-hoc search) are in-train datasets used for training while others are out-of-train used only for evaluation.Although SciDocs tasks are used as out-of-training evaluation tasks, we report their performance in a separate category.
Next, we briefly describe each of the task formats and their component tasks.Full details are provided in Appendix A.
Except for Search, all the tasks use paper embeddings created from a combination of paper title and abstract as the input.Search requires additional metadata (subsection 4.1) which is concatenated to the title and abstract.
Ad-Hoc Search In ad-hoc search tasks, we are given a short textual query and the task is to rank a set of candidate papers by relatedness to the query.Ad-hoc search is a critical mechanism for paper discovery in practice, and we gather multiple real-world data sets for training and evaluation.One of the evaluation datasets comes from previous work, TREC-CoVID (Voorhees et al., 2021), a biomedical challenge task that ranks papers from CORD-19 (Wang et al., 2020b) in response to textual search queries.Two new datasets are introduced in our work: 'feeds' data taken from a scholarly paper recommendation system, where we treat the user-specified feed name as the topic query, and aim to rank the papers the user has annotated as relevant above those annotated as irrelevant.The other is a large training Table 1: Summary of SciRepEval tasks across the four formats -classification (CLF), regression (RGN), proximity (PRX) and adhoc search (SRCH).The models in section 6 are first trained on the in-train tasks and then benchmarked on their held-out sets as well as the 17 test tasks.Information retrieval tasks have Q queries with P candidate pairs and the S2AND task has X clusters with Y author-paper pairs.S: Silver, G: Gold.SciDocs is evaluated as per Cohan et al. (2020).Proximity Similar to ad-hoc search, proximity tasks involve ranking a set of candidate papers by their relatedness to a query, except the query in this case is a paper as well.
Proximity-based tasks form a basis for paper-based retrieval and recommendation, and for estimating paper similarity for use in applications like author disambiguation.We include a total of eleven proximity-based tasks, including four evaluation tasks from SciDocs (predicting citations and co-citations, and predicting co-viewed or co-read papers), and two others from previous work: the S2AND author disambiguation task (Subramanian et al., 2021) with paper similarity features, and Paper-Reviewer Matching, where candidate reviewers are ranked by expert annotators based on the similarity of their papers to the query paper to be reviewed.The Paper-Reviewer Matching task combines three existing datasets (Mimno & McCallum, 2007;Liu et al., 2014;Zhao et al., 2022) which we describe in more detail in subsection A.2.We also introduce five new proximity tasks including two feeds evaluation tasks from the recommender discussed above, where one or multiple relevant papers serve as queries.For training, we include three large-scale datasets aimed at predicting same-authors, citations (via triplets) as in Cohan et al. (2020), and influential citations, which we define as four or more citations of the same paper in the text of a single paper.
For evaluating embeddings in proximity tasks, we rank candidates by Euclidean embedding distance, using MAP as the evaluation metric except for S2AND, which uses B 3 F1 (Bagga & Baldwin, 1998), and Peer Review Matching, which uses precision@5 and @10.
Classification Classifying papers into topical categories is a foundational task for document organization and discovery.Apart from the two SciDocs tasks (MAG and MeSH Diseases), we have four others; including a binary task to predict whether a paper is relevant to biomimicry (Shyam et al., 2019), two biomedical classification tasks, namely DRSM from Burns (2022) and MeSH Descriptors classification (Lipscomb, 2000), and a new large-scale field of study (FoS) multi-label training set of more than 500K papers with silver FoS labels based on publication venue.
We evaluate embeddings on classification by scoring their performance as features within linear support vector classifiers.Results for these tasks are evaluated using F1 score (which may be binary-or macro-F1 depending on the dataset, indicated in Table 1).To better understand how embeddings perform in data-scarce regimes, we also construct two few-shot versions each from both out-of-train classification datasets and the FoS dataset subset for which we have manually annotated gold labels.
Regression We also consider a set of regression tasks where the goal is to predict a continuous quantity for a given paper.For evaluation, we consider predicting three numeric attributes related to prominence or quality: Tweet Mentions (Jain & Singh, 2021), and two new datasets predicting peer review rating and maximum h-index of authors for a collection of ICLR papers obtained from OpenReview1 .For training, we introduce two additional datasets; predicting citation count and year of publication of papers.
We evaluate embeddings on regression by scoring their performance as features within linear support vector regression models.The reported results are computed as Kendall's τ rank correlation between the true and predicted labels.2

Multi-format representation learning
Typical approaches for learning document embeddings produce a single embedding for every task (Cohan et al., 2020;Ostendorff et al., 2022b).We hypothesize that a single embedding will be insufficient for generalizing across a diversity of downstream tasks when the embeddings are used as features in lightweight classifiers.At the other extreme, learning embeddings for each task separately does not allow generalization to new tasks and also incurs significant storage costs scaling with the number of tasks.We propose method for learning a distinct document embedding for each task format, using a multi-task learning framework.
We assume we are given labeled data from a set of tasks for our four formats (ad-hoc search, proximity, classification, and regression), and we learn models capable of producing an embedding for any given (paper, format) pair.Here, papers are represented in terms of their title and abstract.
Our goal is for these embeddings to be used in lightweight classifiers/regressors as well as in nearest neighbor tasks, which we evaluate both on held-out data from the training tasks, and on new held-out tasks.
To help build intuition for why different embedding sets for different task formats may be helpful, Figure 1 illustrates the qualitative distinctions between the task formats.In general, an embedding space performing well for one task format may be less suited to the others; for example, the classification space provides an error-free linear classifier, but its nearest neighbor pairs are not always of the same class.Empirically, we find that learning specialized embeddings per format improves performance, and that embeddings trained on a format tend to perform better on held-out tasks with the same format (see Table 4).Further, partitioning randomly (as discussed in Table 7) was less effective than the formatbased partitioning.Nonetheless, format-based partitioning is just one choice of many and experimenting with other partitioning schemes is an important item of future work.

Model
We follow Cohan et al. (2020) in using a pretrained transformer encoder as our base model.A scientific document is given as input to the encoder as a concatenation of its title and abstract separated by the [SEP] token3 .Unlike Cohan et al. (2020), we use three different types of training objectives suitable for each format to train the model as described in subsection 4.2.We explore two methods to learn separate embeddings for each task form: control codes and adapters as shown in Figure 1.

Control Codes
In the control code approach, we prepend a special per-format token (see Table 7 in the appendix) to the input and pass it to the transformer model, taking the final layer representation corresponding to this token as the document embedding and feeding it as input to the task-specific head (described in subsection 4.2).
Adapters We also experiment with adapters which have been shown to be effective for multi-task learning.In particular, we explore Adapter Fusion (Pfeiffer et al., 2021) and PALs (Stickland & Murray, 2019) methods, each of which introduces task-specific adapters and attention modules at every transformer layer.Since our goal is to learn different embeddings for different task formats, we create modules for each task format rather than each task, and the final hidden representation of the [CLS] token output via the adapter is taken as the corresponding embedding of the document.

Training
We train the model in a multi-task setup with taskheterogeneous batching (Aghajanyan et al., 2021).For classification and regression, we use a linear head atop the base transformer encoder4 .We train on both multi-class and multi-label tasks, using Cross Entropy loss for the former and Binary Cross Entropy (BCE) with sigmoid activation for the latter.For regression we minimize the Mean Square Error (MSE) loss.
For proximity and ad-hoc search tasks we use the triplet loss as in Cohan et al. (2020).For these task forms, given a query, a relevance score accompanies each candidate.The query can be a document (for which we wish to find similar documents) or a raw textual query.Each training instance in this setup is a triplet consisting of a paper or plain text query Q, a positive candidate paper P+ and a negative candidate P−, where P+ has a higher score than P−.Then, we optimize the triplet loss: where d is the Euclidean distance used as a measure of similarity between the query embedding Q E and candidate embeddings P + E and P − E .ϵ is the margin hyperparameter whose value is 1 chosen based on preliminary experiments.

Experiment Setup
Training Data We train our multi-format models on the 8 large in-train tasks detailed in Table 1.For proximity and ad-hoc search, we create up to 5 examples for each query by sampling positive and negative papers from its candidate pool.We limit the number of training samples from each task to at most 600K.5 , resulting in training and validation sets of a total of 3.27M and 446K instances respectively.
Transformer Baselines As a first step, we evaluate existing document representation methods on our bench-mark.These include SciBERT (Beltagy et al., 2019), a language model pre-trained on scientific corpora, and paperembedding methods including SPECTER (Cohan et al., 2020), ASPIRE (Mysore et al., 2022), andSciNCL (Ostendorff et al., 2022b) which is the state-of-the-art on SciDocs.ASPIRE produces representations for aspect-based matching between query and candidate papers which is a similar setting to our proximity tasks, so we only evaluate it on that task subset and report the results in Appendix C. Finally, we also evaluate two general-purpose text embedding methods, E5 (Wang et al., 2022) and MPNet (Song et al., 2020).These methods are pre-trained on a set of over 1B query-candidate pairs which includes scientific text, and are two of the best-performing BERT-base sized models on the recent MTEB benchmark Muennighoff et al. (2022) 6 for general-purpose text embeddings.
Next, for our multi-format baselines, we initialize with SciNCL and further train it in a multi-task setup on the intrain tasks both with (MTL CTRL) and without the control codes (MTL CLS).Finally, to compare the control codesbased approach with the adapter techniques, we train the BERT PALs and Fusion architectures, keeping SciNCL as the base model in both.Fusion being a two step process, first introduces task format specific adapters (Adapters) and then the fusion modules (Adapter Fusion).The MTL CTRL and adapter approaches produce multiple representations per document while MTL CLS produces a single representation similar to existing methods.We use the PyTorch implementations of the models by HuggingFace7 .The specific training configurations are described in Appendix B.

Results
Table 2 shows the evaluation of all our transformer baselines producing both single and multiple representations per document on SciRepEval.Our benchmark includes diverse tasks with a variety of different evaluation metrics, and following previous work [e.g.Wang et al. (2019)] we report an average of the individual metrics (each ranging from 0-100).Among the vanilla models, even though E5 and MPNet perform better than SciBERT, SPECTER and SciNCL still outperform them suggesting the need for domain specific embeddings to do well on SciRepEval.The pre-fine-tuned multi-format variants outperform the vanilla models on average.We also find that all the approaches that produce multiple representation types outperform the MTL CLS model, which learns only a single representation shared for all tasks by up to 1.5 points,.The adapter variants are better than MTL CTRL overall, and result in an improvement of 0.6-1.3 points on the out-of-train tasks with task-format specific adapters performing the best.
Table 2: Evaluation results on SciRepEval in multiple settings.MTL CLS generates a single embedding for all tasks, MTL CTRL (control codes) and Adapter variants (Adapters, PALs, and Adapter Fusion) produce an embedding per task format.We consider an ensemble approach that averages the MTL CTRL and Adapter embeddings.For models we trained, we report the mean and standard deviation (in parentheses) across 5 runs with different seeds.The best results are highlighted in bold.We conduct one way analysis of variance (ANOVA) with Tukey's test (Haynes, 2013) for α = 0.05 across multiple settings and underline those not statistically significantly different from the best.

Model
In-Train Out-of- Further, as shown in Table 5, the control codes and adapters are the most efficient in terms of model size and computation runtime.Hence, we try to improve upon each by combining representations from the Adapter and the MTL CTRL models by averaging them 8 , and we find that these combined embeddings outperform the individual ones consistently across the in-train, out-of-train, and SciDocs settings.All the models except SciBERT (not pre-trained with a citation objective) do well on SciDocs, with vanilla SciNCL being the best.ASPIRE, as reported in Appendix C, performs well on SciDocs but not on other similar tasks in SciRepEval.
Alternative Base Models To confirm that our findings hold across multiple base models, we compare MTL CLS, MTL CTRL and adapters with SPECTER and SciBERT as the base models.Table 3 shows that the MTL CTRL token and the adapters approaches still substantially outperform the MTL CLS approach, suggesting that the efficacy of using an embedding per task format instead of a single embedding per document holds across a range of base model types.

Analyses
Specialization of Control Code Embeddings Our hypothesis is that by training embedding spaces on particular task formats, they will become more accurate for tasks of that format than for others.We test this hypothesis by sampling one in-train and one out-of-train 9 task of every format 8 We tried concatenating the embeddings in initial experiments, which yielded similar results but doubled the embedding size. 9In-train: FoS, Citation Count, Same Author Detection, Search; Out-of-train: DRSM, Peer Review Score, Peer-Reviewer Matching, TREC-CoVID (for ease of computation) and applying all the control codes to them for evaluation.As shown in Table 4, the control codes trained on a task format perform best for tasks of that format, for both in-train and out-of-train.
As an extension to this experiment we also analyze how well the control code representations work when the encoder is trained on tasks which are randomly grouped together as opposed to by task format.We take the mean evaluation metrics produced from 5 random partition runs.On the out-of-train tasks, the corresponding control codes for classification, regression, proximity and ad-hoc search show a gain of +0.2, +3.9, +4.5 and +2.2 points respectively over random partitioning.Similarly, for in-train tasks the control codes are better by +5.2, +3.8, +1.2 and +1.3 points respectively.The results suggest that representations specific to each task format do lead to better results overall.
Finally, to study training affinity among the task formats, we pre-fine-tune on a maximum of two formats at once.Appendix G reveals that combined multi-task training on similar task formats like regression/classification and proximity/adhoc-search results in performance gains, but only on related tasks.Training on all the tasks yields better results on average across the task formats.
Efficiency While the variants producing representations based on task-format serve as strong baselines on the SciRepEval benchmark as shown in Table 2, efficiency is another important consideration in practice.As shown in Table 5, the control code approach only requires one new control code embedding per format, and has no impact on training time.PALs, in contrast, introduce new attention layers and train the entire network, increasing runtime, while

Conclusion
We introduce SciRepEval, a benchmark for scientific document representation methods with 25 tasks across four task formats.On this benchmark, we show that learning a separate document representation for each task format substantially improves task performance compared to learning a single representation for all tasks.Future work could address limitations of our work by evaluating partitioning schemes beyond task format, crafting higher-fidelity metrics to account for the diversity of tasks in SciRepEval (which vary in sensitivity and in relevance to downstream applications), or further exploring how accuracy varies with computational and storage cost.
and feedback.We would also like to acknowledge the support of NASA AATT Project for funding the PeTaL research and contributing the biomimicry dataset.This work was supported in part by NSF Grant 2033558.

A.4. Regression
Citation Count We sample a collection of scientific articles published in 2016 from the set of papers in the search dataset described in subsection A.1, so that a 5 year period has passed for them to collect citations.Each article has at least one citation, and the citation counts are converted to log scale.

Year of Publication
The aim of this task is to determine research trends by predicting the year of publication of a scientific article.We sample publications from the search dataset with a publication date after the year 2005 and scale the years so that their values are between 0 and 1.Further, since this task is used for training along with citation count prediction, and to align the loss scales, the labels are scaled by the mean of the labels in citation count for parity.
Peer Review Score We use the OpenReview API10 to collect paper metadata and corresponding review scores for ICLR conferences from 2017 to 2022.Each reviewer in ICLR assigns a final rating in the range [0-10], and we take the mean rating as the label for every paper.
h-Index of Authors In this task the goal is to predict the maximum h-Index of any of the authors of a scientific publication.We re-use the peer review score dataset, obtain the h-Index of all the authors for each paper using the Semantic Scholar API11 , and pick the max as the label.The labels are normalized to lie between [0,1].

Tweet Mentions
The goal of this task is to predict the combined number of a paper's mentions and retweets.We post-process the dataset created by Jain & Singh (2021) containing tweets about Arxiv papers between 2010-19.The sum of normalized counts of mentions and retweets is finally considered as the score to be predicted.

B. Implementation details
During pre-training, all the tasks with the same format share their task-format specific parameters.The control code based paradigm introduces four new (randomly-initialized) special tokens to the vocabulary.We try initializing these additional parameters randomly, with the [CLS] token and a combination of [CLS] with some noise.However, it has little impact on the resulting model performance with random initialization being better on average.Further, we also tried loss weighting strategies (Chen et al., 2018;Liu et al., 2019a) but our preliminary experiments produced better results without any scaling so we didn't explore it further.All the base models are trained for two epochs on two 48GB NVIDIA Quadro RTX 8000 GPUs with 16 bit precision, an effective batch size of 256, and a maximum input length of 512 tokens.Each batch is sampled with an equal number of examples from each task. 12We use AdamW (Loshchilov & Hutter, 2019) with ϵ = 1e-8.The learning rate follows an inverse square root schedule with a linear warmup of 700 steps and peak of 5e-5.
The adapter approaches follow the two step training process and learning rate configurations described in Pfeiffer et al. (2021).One adapter per task family is attached to the base model in both single adapter and fusion stages and is trained for a maximum of 6 and 4 epochs respectively.For PALs one layer is added per task format and the entire network is trained for 2 epochs as in Stickland & Murray (2019).

C. ASPIRE Evaluation
ASPIRE (Mysore et al., 2022) produces representations for the dense retrieval of scientific documents based on matching multiple aspects between the query and candidates.To evaluate these representations under the settings they are designed for, we only report the results on the proximity tasks in Table 8.We use the model implementations available on HuggingFace which have been pre-trained on documents from the Computer Science (CS) and Biomedical (Bio) domains.The models variants can be further sub-categorized as retrieval based on best aspect matching (TS ASPIRE) and a weighted sum of the similarity score among all the aspects based on Optimal Transport (OT ASPIRE) between the query and candidates.Both our multi-format approaches with control codes and adapters produce better results overall and on out-of-train tasks.Note however, since ASPIRE models are trained on co-citations, they perform much better on average on the citation based tasks from SciDocs.

D. SciRepEval Domain Distribution and MDCR Evaluation
We study the domain diversity of SciRepEval and display the results in Table 9.To compare against the training data for SciDocs, we consider the citation prediction triplets on which SPECTER is trained which is also a subset of the SciRepEval in-train tasks.Even though Medicine and Computer Science papers still form a bulk of the data, SciRepEval has 105x more documents on average per domain compared to the SPECTER triplets.
Further, as shown in Table 6, our task format based models outperform BM25 on the MDCR benchmark (Medić & Šnajder, 2022) and establish the new state of the art.Table 10 displays the breakdown of the results by fields of study.Apart from Geology and History, the ensemble model is equivalent or better than BM25 on all the scientific domains.

E. SPECTER Objective
Lastly, we perform an ablation study to better understand the importance of the unsupervised citation-based training objective.We used SciBERT as the base model for this ablation since both SPECTER and SciNCL were trained with the citation objective.Removing the citation objective and its accompanying data from SciBERT + MTL CTRL, we find that the in-train performance drops from 61.9 to 61.8, while out-of-train drops from 57.9 to 57.5, hinting that the citation objective may be helpful for generalization to new tasks.

F. Cross-Task Correlation Analysis
In Figure 2 we show Pearson correlations of model performance metrics between tasks in SciRepEval.To compute the correlations, we include all of the individual task results of the model runs shown in Table 2 and Table 3, excluding the ensembles.The correlations between tasks in SciDocs (bottom right) are highest, while correlations between tasks in the entirety of SciRepEval span a larger range.Notably, DRSM-Complete and S2AND are uncorrelated with most other tasks.This shows that the overall task diversity is larger in SciRepEval than in SciDocs.
Table 9: Data domain distribution in SciRepEval for the training tasks and comparison with SciDocs.We group the unique documents in both the benchmarks by their MAG (Wang et al., 2020a) fields of study and present the counts in columns 2 and 3 and the absolute increase per field in column4.

G. Task Relatedness For Multi-Task Training
To train on multiple tasks simultaneously, care has to be taken to choose a combination of tasks such that negative transfer is avoided, but finding this optimum combination of tasks is hard (Aribandi et al., 2022;Padmakumar et al., 2022;Fifty et al., 2021).For a given set of T tasks, it's often not feasible to train on all 2 T − 1 task combinations to find the best one.Hence, rather than coming up with an optimum combination, recent work suggests pre-fine-tuning on a large collection of tasks simultaneously to offset the negative transfer between a subset of those tasks (Aghajanyan et al., 2021;Aribandi et al., 2022).(Padmakumar et al., 2022) show that pre-fine-tuning on a small set of tasks related to the downstream task is more efficient than large scale multi-task training and yields similar results.As shown in Table 11, we study this training affinity among our task formats by pre-fine-tuning on individual task formats as well their combinations in a multi-task setup.Supporting the findings in Padmakumar et al. ( 2022), pre-fine-tuning on all the tasks using each individual format's training data leads to better or similar performance on related downstream tasks when compared to training on all the formats at once for all the cases except out-of-train classification.Moreover, combining related task groups like classification and regression during pre-fine-tuning results in better task transfer than training on them individually.This may be due to the fact that the learned representations of both are consumed as features by a downstream linear SVM.Similarly, proximity and ad-hoc search seem to give one another a boost when trained together.However, training on all the tasks simultaneously yields the best results on average, and individual results are within 1 point of the best combinations per task format, except for out-of-train regression and in-train classification.
(i) SciRepEval, a new comprehensive benchmark of 25 highly diverse and practical tasks for scientific document representation techniques across four different formats, of which 11 are made available for the first time, and six are explicitly designed for training.

Figure 1 :
Figure 1: Generating multi-format embeddings.A task format is either associated with a special control code appended to the input, or adapter blocks attached to the model.

Table 3 :
Results for multi-format training with SciBERT and SPECTER as base models.For brevity, we report only the single adapters results due to their advantage of computation efficiency.The best results for each base model are underlined.

Table 4 :
Cross task analysis for control codes.The best results for each task format across all control codes is underlined.These are represented in the diagonal for both in-train and out-of-train tasks suggesting that format based partitioning in multi-task training produces effective document representations suitable for the corresponding format.

Table 5 :
Parameter and (relative) runtime efficiency of models.MTL CTRL and Adapters are similar in runtime, but PALs and Fusion variants add significant computation costs.

Table 6 :
Comparison of SciNCL models trained on SciRepEval with BM25 on the MDCR benchmark.As in the original paper, we report MAP and Recall@5 scores.The best results obtained are highlighted in bold.Adapters add and only train half as many parameters as PALs.Fusion layers have 10x as many parameters as PALs leading to 2x more time on inference.Training and inference times are measured with 1k and 10k samples, respectively.

Table 7 :
Assigned input formats and control codes for each task form.[CLF], [RGN], [PRX] and [QRY] are special tokens, doc is the input.

Table 8 :
Comparison of the our SciNCL multi-format methods with ASPIRE on proximity tasks.The best results for each base model are underlined.TS: Text Supervision, OT: Optimal Transport For classification and regression, we train a linear SVM on each downstream task using the embeddings as input, and we tune the regularization parameter C via grid search.Multi-class and multi-label classification are configured under the one vs all classifier setting.