Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding

Scientific literature understanding tasks have gained significant attention due to their potential to accelerate scientific discovery. Pre-trained language models (LMs) have shown effectiveness in these tasks, especially when tuned via contrastive learning. However, jointly utilizing pre-training data across multiple heterogeneous tasks (e.g., extreme multi-label paper classification, citation prediction, and literature search) remains largely unexplored. To bridge this gap, we propose a multi-task contrastive learning framework, SciMult, with a focus on facilitating common knowledge sharing across different scientific literature understanding tasks while preventing task-specific skills from interfering with each other. To be specific, we explore two techniques -- task-aware specialization and instruction tuning. The former adopts a Mixture-of-Experts Transformer architecture with task-aware sub-layers; the latter prepends task-specific instructions to the input text so as to produce task-aware outputs. Extensive experiments on a comprehensive collection of benchmark datasets verify the effectiveness of our task-aware specialization strategy, where we outperform state-of-the-art scientific pre-trained LMs. Code, datasets, and pre-trained models can be found at https://scimult.github.io/.


Introduction
Scientific literature understanding tasks, such as paper classification (Zhang et al., 2023b), citation prediction (Bhagavatula et al., 2018), scientific literature search (Voorhees et al., 2021), and recommendation (Kanakia et al., 2019), have received increasing attention because they can be broadly applied to academic service platforms (Tang et al., 2008;Sinha et al., 2015;Ammar et al., 2018), and more importantly, uncover knowledge structures to accelerate scientific discovery (Naik et al., 2022;Chandak et al., 2023).Recent studies have demonstrated the effectiveness of pre-trained language models (LMs) (Beltagy et al., 2019;Gu et al., 2021;Liu et al., 2022) in these tasks as they generate highquality scientific text representations, especially when the LMs are further tuned via contrastive learning.For example, MICoL (Zhang et al., 2022) proposes a metadata-induced contrastive learning that can perform extreme multi-label paper classification with more than 10,000 classes; SPECTER (Cohan et al., 2020) and SciNCL (Ostendorff et al., 2022) leverage citation information to create training pairs and achieve remarkable performance in predicting various types of links between papers.
Nevertheless, jointly using data across different scientific literature understanding tasks for LM pretraining remains largely unexplored.Intuitively, there are some common knowledge and skills that can be shared across related tasks.For example, accurately identifying fine-grained topic classes of a paper not only helps classification but also provides hints to link prediction and search.Therefore, a multi-task learning framework is expected to be beneficial with improved parameter efficiency.However, if all parameters of the backbone LM are shared across tasks (Liu et al., 2019), the model is observed to suffer from the undesirable task interference (Ma et al., 2023), that is, the model sacrifices the performance on some tasks to boost the others when jointly trained to a certain extent.This is because specialized skills are still required in different tasks competing for the limited shared parameter space.For instance, the encoder for extreme multi-label paper classification should focus more on fine-grained fields-of-study entities in each paper, while the encoder for citation prediction should put more effort into understanding citation intents.Mixing these two skills may result in a negative transfer across the two tasks.
In this paper, to mitigate task interference in multi-task scientific literature understanding, we propose to consider two techniques: task-aware specialization and instruction tuning.Task-aware specialization, inspired by the Mixture-of-Experts (MoE) Transformer architecture (Fedus et al., 2022;Du et al., 2022;Zhou et al., 2022;Cheng et al., 2023), modifies the Transformer block in the LM to have multiple parallel sub-layers, each of which is dedicated for one task.When performing different tasks, the input will be routed to different sublayers based on task types.In this way, the encoder contains both shared parameters and task-specific parameters, making it capable of producing taskaware outputs when tapping into shared knowledge.In contrast, instruction tuning adopts one encoder for all tasks, but it prepends task-specific instructions (Wei et al., 2022;Sanh et al., 2022;Ouyang et al., 2022;Wang et al., 2022;Chung et al., 2022;Asai et al., 2023) to the input text during training so that the encoder can learn to produce task-aware representations.These two ideas are different from the techniques (e.g., adapters (Houlsby et al., 2019) and control codes (Keskar et al., 2019)) explored in previous studies (Singh et al., 2022) for multitask scientific text representation learning.Indeed, as far as we know, this is a pioneering study that explores the effect of the MoE Transformer architecture and instruction tuning in scientific NLP.
To validate the efficacy of our proposed techniques, we conduct a comprehensive empirical study using datasets from multiple sources (Kanakia et al., 2019;Cohan et al., 2020;Thakur et al., 2021;Zhao et al., 2022;Singh et al., 2022;Zhang et al., 2023b) for evaluating various scientific literature understanding tasks.For each task, models will be tested on not only in-domain but also cross-domain evaluation datasets.Specifically, for extreme multi-label text classification, models trained on computer science and biomedicine papers will be tested in the geography and psychology fields (Zhang et al., 2023b); for link prediction, models trained on citation signals (Cohan et al., 2020) need to be evaluated on patient summary retrieval (Zhao et al., 2022) and paper recommendation (Kanakia et al., 2019); for search, models will be tested on datasets specific to COVID-19 (Voorhees et al., 2021) or related to claim verification (Wadden et al., 2020) which are not seen during pre-training.Experimental results show that SciMult-MHAExpert outperforms competitive scientific pre-trained LMs (Cohan et al., 2020;Ostendorff et al., 2022;Singh et al., 2022) on most datasets and achieves the new state-of-the-art performance on the leaderboard of PMC-Patients (Zhao et al., 2022).Ablation studies further prove that task-aware specialization can effectively mitigate task interference, while the improvement brought by instruction tuning is not consistent across all tasks.

Background
We consider three widely studied tasks in scientific literature understanding: classification, link prediction, and search.(Extreme Multi-Label) Classification.Classifying academic papers to their relevant label(s) is a fundamental task in scientific text mining.It can help organize papers according to their fields/themes and benefit downstream applications such as trend analysis of scientific topics (Prabhakaran et al., 2016;Jin et al., 2021).The label space L can be either coarse-grained (Cohan et al. 2020; e.g., predicting whether a paper belongs to "Computer Science" or "Biology") or finegrained (Peng et al. 2016;Xun et al. 2019;Ye et al. 2021;Zhang et al. 2021Zhang et al. , 2023b; e.g., predicting whether a paper is relevant to "Alphacoronavirus" or "Betacoronavirus" or both).When L is large and fine-grained (e.g., with more than 10,000 labels), it is natural to assume that each paper p can be relevant to more than one label.This task is called extreme multi-label classification (Liu et al., 2017;Prabhu et al., 2018;Chang et al., 2020), which aims to rank all labels l ∈ L according to how likely p is relevant to l. Link Prediction.Link prediction aims to predict if a certain type of link exists between a query paper p Q and a candidate paper p C .Narrowly, "links" refer to citation links (Bhagavatula et al. 2018;Wright and Augenstein 2021; i.e., p Q cites p C ). Broadly, "links" can be defined as relations that p Q and p C are co-viewed frequently by users, co-cited frequently by other papers, and so on (Cohan et al., 2020;Zhao et al., 2022).An accurate link prediction model can benefit tasks like paper recommendation (Kanakia et al., 2019) and help identify the potential use of scientific literature (Yin et al., 2022;Lin et al., 2023).Search.Scientific literature search helps researchers track their interested fields-of-study and prevents them from drowning in the whole literature.Given a search query q and a pool of papers P, the task is to find papers p ∈ P that are relevant to q. Search also serves as the initial step of more complex scientific text mining tasks, such as claim verification (Wadden et al., 2020) and open-domain question answering (Jin et al., 2019).

Multi-task Contrastive Learning
One can observe that the three tasks bear a common feature -a "query" q and a pool of "candidates" C = {c 1 , c 2 , ..., c |C| } are given.To be specific, q is the paper p to be classified in classification, the query paper p Q in link prediction, and the search query q in literature search; C is the label space L in classification, the set of candidate papers p C in link prediction, and the candidate paper pool P in search.This motivates us to jointly train a multi-task model that is applicable to all tasks.
To implement this idea, our proposed SciMult framework is built upon a Bi-Encoder architecture, where two encoders (whose parameters are shared) encode queries and candidates independently.Following Karpukhin et al. (2020), we adopt maximum inner product search (MIPS) in the embedding space to find positive candidates for each query.Formally, the similarity is calculated as follows: where E( The Bi-Encoder is trained via a contrastive learning objective by pulling relevant q and c + together while pushing irrelevant q and c − apart: Although it has been shown that model parameter sharing improves the performance of Bi-Encoder retrievers (Xiong et al., 2021), simply sharing all parameters of the backbone LM (Liu et al., 2019) may suffer from task interference and lead to suboptimal performance.This is because the semantics implied by relevant (q, c) pairs for different tasks vary.For example, the encoder for fine-grained paper classification needs to pay more attention to fields-of-study entities, while the encoder for link prediction focuses on understanding 1 This assumption holds for benchmark scientific label spaces such as fields-of-study in the Microsoft Academic Graph (MAG) (Shen et al., 2018) and terms in Medical Subject Headings (MeSH) (Coletti and Bleich, 2001).Meanwhile, if label definitions are not available, our model can take label names as the only input, the performance of which is studied in Appendix E.4.citation intents.To tackle this issue, we consider two different strategies, task-aware specialization and instruction tuning to learn task-specific representations of scientific text.

Task-Aware Specialization
Inspired by recent Mixture-of-Experts (MoE) models (Fedus et al., 2022;Du et al., 2022;Zhou et al., 2022;Cheng et al., 2023;Ma et al., 2023), we propose to adopt task-specific Transformer blocks in the LM architecture.Specifically, a typical Transformer block (Vaswani et al., 2017) contains a multi-head attention (MHA) sub-layer and a feedforward network (FFN) sub-layer.Fedus et al. (2022) propose a Mixture-of-Experts Transformer block with multiple FFN sub-layers stacked upon a shared MHA sub-layer.As shown in Figure 1 (right), we adopt this architecture and let each FFN sub-layer correspond to one particular task t (t ∈ {classification, link prediction, search}).For example, if the encoder E(•) is trained/tested on a classification task, the input will be routed to the classification FFN.Different from Fedus et al. (2022), Ma et al. (2023) propose to specialize the MHA sub-layer and observe better performance in open-domain question answering.We test this architecture as well, where, as shown in Figure 1 (left), a shared FFN sub-layer is stacked upon taskspecific MHA sub-layers.We again use the same task-dependent routing for this variant.
In SciMult, following Du et al. (2022), we stack typical Transformer blocks and task-specific Transformer blocks alternately, i.e., for a base-size LM with 12 Transformer blocks, there will be 6 typical Transformer blocks and 6 task-specific Transformer blocks interleaved with each other.In this way, the encoder will have both parameters θ t characterizing specific skills for task t and parameters θ characterizing the common knowledge shared across all tasks.Thus, we denote the encoder as E {θ,θt} (•), and the task-aware similarity between the query q and the candidate c will be modified as simt(q, c) = E {θ,θ t } (q) ⊤ E {θ,θ t } (c). (3) Note that the query encoder and the candidate encoder for the same task still share their parameters.

Instruction Tuning
Training LMs with task-specific instructions (Wei et al., 2022;Sanh et al., 2022;Ouyang et al., 2022;Wang et al., 2022;Chung et al., 2022;Asai et al., 2023) has been extensively studied with remarkable progress achieved.However, the effect of instruction tuning on scientific literature understanding tasks has remained elusive.Moreover, the major focus of previous studies is on effective zero-shot or few-shot model transfer to new tasks rather than mitigating task interference.As a result, a head-tohead comparison between instruction tuning and MoE in multi-task LM pre-training is missing, so we aim to bridge the gap in this paper.Different from task-aware specialization which trains E {θ,θt} (•) for each task t, instruction tuning keeps one encoder E θ (•) with all its parameters θ shared across different tasks.Each task t is instead characterized by a natural language instruction x t .The instructions we use for the three tasks are shown in Table 1.We can prepend the representations of x t to the query and candidate texts to get their task-aware embeddings.To be specific, suppose x t contains K tokens {x t,k } K k=1 .We first use an instruction encoder E ϕ (•) to encode x t and get its token representations {x k=1 after each layer n ∈ {0, 1, ..., N }.Then, we use the query/candidate encoder E θ (•) to encode q = q 1 q 2 ...q A and c = c 1 c 2 ...c B .At layer n ∈ {1, 2, ..., N }, the output representations of q and c will take the instruction token representations corresponding to that layer as context.Formally, ). (4) The task-aware similarity between q and c will then be modified as (5) There are two ways to train the model.First, we can update the entire architecture, including both the instruction encoder E ϕ (•) and the

Task Instruction
Classification Tag a scientific paper with relevant scientific topic classes.
Link Prediction Find a pair of scientific papers that one paper cites the other.

Search
Retrieve a scientific paper that is relevant to the query.
Table 1: Instructions used for the three tasks.
query/candidate encoder E θ (•), and let them share parameters (i.e., ϕ = θ).Second, we can keep the query/candidate encoder frozen and optimize the instruction encoder only, which bears similarities with prefix-tuning (Li and Liang, 2021) by treating instructions as prefixes.We will evaluate both approaches in our experiments.

Negative Sampling
Previous studies on contrastive learning with scientific text (Cohan et al., 2020;Ostendorff et al., 2022) have emphasized the importance of hard negatives.Specific to citation prediction, Cohan et al. (2020) propose a way to derive hard negatives: Given a positive query-candidate pair We generalize this idea to the other tasks.For extreme multi-label classification, given a positive paper-label pair (p, l), we consider a paper p ′ cited by p, sharing a common author with p, or published in the same venue as p, in which case p ′ should be semantically close to p.If p ′ has a label l ′ that is irrelevant to p, then l ′ is treated as a hard negative for (p, l).For literature search, we use the training data from Singh et al. (2022) which contains a short list of papers returned by an academic search engine for each query q.The papers clicked by users will be treated as positives p + , and the others in the list will be viewed as hard negatives p − .
Related studies (Cohan et al., 2020;Ostendorff et al., 2022) show that combining easy negatives (i.e., negatives randomly sampled from the entire candidate pool C) and hard negatives leads to better performance.Thus, during pre-training, for each positive pair (q, c + ), we sample one hard negative and treat all in-batch negatives (Karpukhin et al., 2020) as easy negatives, both of which are combined as c − 's to optimize Equation 2.

Datasets
We adopt a comprehensive collection of benchmark datasets for model evaluation.Each task has its pre-  training, in-domain evaluation, and cross-domain evaluation datasets2 , which will be briefly introduced below.Table 2 summarizes our usage of these datasets with more details in Appendix A.
Classification.We consider the MAPLE benchmark (Zhang et al., 2023b), which consists of 23 fine-grained multi-label paper classification datasets across 19 scientific fields.Each paper in MAPLE is tagged with its relevant MAG fieldsof-study (Shen et al., 2018) and MeSH terms (Coletti and Bleich, 2001).Among the 23 datasets, CS-Journal, Biology-MeSH, and Medicine-MeSH are selected for pre-training; CS-Conference and Chemistry-MeSH are used for in-domain evaluation; Geography and Psychology, whose candidate label spaces are not seen during pre-training, are utilized for cross-domain evaluation.Besides, we use the MAG and MeSH datasets in the SciDocs benchmark (Cohan et al., 2020) as in-domain evaluation datasets for coarse-grained paper classification.
Link Prediction.For pre-training, we leverage more than 819K citation prediction triplets released in Cohan et al. (2020), which were used to pre-train SPECTER (Cohan et al., 2020).For in-domain evaluation, we make use of the SciDocs benchmark (Cohan et al., 2020), which evaluates the prediction of four link types: Co-view, Co-read, Cite, and Co-cite.For cross-domain evaluation, we use (1) the PMC-Patients dataset (Zhao et al., 2022) where each query is a patient summary and the task is to find its linked research articles and patient summaries, and (2) the Recommendation dataset (Kanakia et al., 2019) collected via an online survey, where the participants are authors of query papers, and they need to judge the relevance between the query paper and some candidate papers on a scale of 1 to 5.
Search.For pre-training, we exploit the Search dataset released in the SciRepEval benchmark (Singh et al., 2022) with 528,497 queries.It also has a hold-out testing set with 2,637 queries, which we employ for in-domain evaluation.For cross-domain evaluation, we adopt TREC-COVID (Voorhees et al., 2021), SciFact (Wadden et al., 2020), and NFCorpus (Boteva et al., 2016), all of which are from the popular BEIR benchmark (Thakur et al., 2021).Note that TREC-COVID has two different versions in SciRepEval and BEIR.
The BEIR version has more simplified queries and a larger pool of candidate papers.We will report model performance on both versions.
Besides LM baselines, we also report the performance of some task-specific classical methods, such as Citeomatic (Bhagavatula et al., 2018) for citation prediction, Kanakia et al. (2019) for recommendation, and BM25 (Robertson and Walker, 1994) for search.

Fine-grained Classification
Following Zhang et al. (2023b), we consider a simple heuristic when ranking all candidate labels: labels whose name appears in the query paper p should be ranked higher than those not appearing in p.In other words, we first rank all labels l according to sim(p, l) and then reorder all labels appearing in p in front of all other labels.Here, we evaluate model performance using Recall@k (k = 20, 50, 100), i.e., the proportion of gold labels found in the top-k retrieved results to all gold labels of a paper.
The results are shown in Table 3.We also conduct experiments without using this heuristic, the results of which are shown in Table A3, where all models perform consistently worse.From Table 3, we observe that: (1) SciMult-MHAExpert and SciMult-Instruction outperform all baselines on all four datasets.The other three SciMult variants have significant advantages over all baselines on in-domain evaluation datasets, but their edges on cross-domain datasets are not consistent.(2) Comparing among all the SciMult variants, SciMult-MHAExpert is always the best on cross-domain datasets, indicating its generalizability to unseen label spaces.

Coarse-grained Classification
Following Cohan et al. (2020) and Ostendorff et al. (2022), we directly predict the most likely coarse label of each paper and use Macro-F1 as the evaluation metric.Two different settings are considered here: (1) The Bi-Encoder setting (abbreviated to "BiEnc" in Table 4) follows our practice in finegrained classification where we calculate sim t (p, l) given the description of each l (without relying on any training data after LM pre-training).( 2) The Linear Classifier setting (abbreviated to "Linear" in Table 4) follows the practice in Cohan et al. (2020), which takes the embedding vector E(p) as the input feature of each paper p and trains a linear SVM for classification.This setting requires labeled training and validation data after LM pre-training but no longer needs label descriptions.
The results of both settings are shown in Table 4.We find that, on average, SciMult variants can always beat all baselines with a more pronounced outof-box advantage (i.e., the "BiEnc" setting).On the other hand, when further fine-tuned with a linear classifier, most pre-trained models perform very similarly with better performance achieved by the multi-task models (SPECTER 2.0 and our SciMult variants), indicating the advantage of multi-task training for coarse-grained classification.

Link Prediction
In the link prediction task, we also consider two settings: retrieval and reranking.
For the retrieval setting, given a query paper p Q , the task is to find all candidate papers p C that are linked to p Q (via the relation "Cite", "Co-cite", etc. ) from the whole dataset.The link prediction performance of compared methods under the retrieval setting is shown in Table 5, where the evaluation metrics are Recall@20, 50, and 100.As expected, SciMult variants significantly outperform all baselines in almost all cases.
As for the cross-domain PMC-Patients dataset (Zhao et al., 2022), besides evaluating the zeroshot retrieval performance ( R@20 R@50 R@100 R@20 R@50 R@100 R@20 R@50 R@100 Average supervised SciMult by further fine-tuning it on the provided training data.Specifically, we pick SciMult-MHAExpert (because of its overall good performance according to Table 9) and use DPR (Karpukhin et al., 2020) to fine-tune it.The comparison between DPR(SciMult-MHAExpert) and existing models on the leaderboard of PMC-Patients4 is shown in Table 6.Our model outperforms all existing models and achieves the new state-of-the-art.
For the reranking setting, we follow the original evaluation protocol of SciDocs (Cohan et al., 2020): For each query paper p Q , a small set of candidate papers {p C1 , p C2 , ...} is given, which contains up to 5 positives and 25 random negatives.The task aims to rank the positives higher than the negatives, so we use MAP and nDCG as evaluation metrics.The Recommendation dataset (Kanakia et al., 2019) has a similar format, except that the score of each candidate is not binary and ranges from 1 to 5. As a result, the model needs to rank candidate papers with higher scores in front of those with lower scores, and we use nDCG and nDCG@k (k = 5, 10) as the metrics.(MAP cannot be used in Recommendation because relevance is not binary.)The performance is demonstrated in Table 7.
From Table 7, we observe that: (1) On Sci-Docs, SciMult variants outperform SPECTER in most cases.Note that for the link prediction task, SciMult uses exactly the same pre-training data (including hard negatives) as SPECTER (Cohan et al., 2020).Therefore, this observation implies that exploiting pre-training data from other tasks can benefit the link prediction performance, which validates our motivation to develop a multi-task learning framework.(2) On SciDocs, SciMult variants can rarely outperform SciNCL and SPECTER 2.0 (but the gaps are not evident).This is possibly because SciNCL uses more complicated hard negative sampling strategies, and SPECTER 2.0 exploits more diverse pre-training data and tasks.In fact, Ostendorff et al. (2022) have pointed out the data leakage issue that 40.5% of SciDocs papers appear in the pre-training data.This motivates us to incorporate the cross-domain Recommendation dataset, on which SciMult-Vanilla and SciMult-MHAExpert consistently outperform all baselines.

Search
We use nDCG@10, the primary metric of BEIR (Thakur et al., 2021), to measure models' performance on the search task.The results are shown in Table 8.We find that, on the four cross-domain evaluation datasets and on average, four SciMult variants can outperform all baselines.In particular, on NFCorpus, only SciMult can beat BM25, while Table 8: Search performance on SciRepEval-Search (Singh et al., 2022), TREC-COVID (Voorhees et al., 2021), SciFact (Wadden et al., 2020), and NFCorpus (Boteva et al., 2016).The score with † is reported in Singh et al. (2022).

MHAExpert FFNExpert Prefix Instruction
Classification (Fine)  all baseline LMs are lacking by a clear margin.

Overall Analysis
To summarize, in Tables 3, 4, 5, and 8, all SciMult variants except SciMult-FFNExpert can always beat all baselines in terms of the average metric.Meanwhile, to systematically examine whether our proposed techniques can mitigate task interference, we calculate the relative performance change of the four non-Vanilla variants in comparison with SciMult-Vanilla in each task.Table 9 shows the results.We observe that SciMult-MHAExpert improves SciMult-Vanilla across all tasks, which implies that the MoE architecture with task-specific MHA sub-layers effectively overcome task interference during multi-task pre-training.By contrast, other proposed techniques are advantageous in a subset of tasks, such as coarse-grained classification and link prediction under the retrieval setting.
To validate the design choices of SciMult, we conduct more analysis through controlled experiments, which can be found in Appendix E. To briefly summarize, we observe that: (1) Using hard negatives in multi-task contrastive learning helps produce higher-quality text representations in general and benefit all tasks.( 2 from Transformer-based ones such as SciBERT (Beltagy et al., 2019), BioBERT (Lee et al., 2020), ChemBERT (Guo et al., 2021), and PubMed-BERT (Gu et al., 2021) to GPT-based ones such as SciGPT2 (Luu et al., 2021) and BioGPT (Luo et al., 2022), aim to learn high-quality contextualized text representations for scientific literature understanding.Subsequent studies have utilized these LMs to fine-grained paper classification (Zhang et al., 2022(Zhang et al., , 2023a)), cite-worthiness detection (Wright and Augenstein, 2021), scientific claim verification (Wadden et al., 2020), and so on.
Besides text information, metadata associated with scientific papers are also broadly considered.For example, Citeomatic (Bhagavatula et al., 2018), SPECTER (Cohan et al., 2020), BioLinkBERT (Yasunaga et al., 2022), and SciNCL (Ostendorff et al., 2022) leverage citation links between papers; OAG-BERT (Liu et al., 2022) models venues, authors, fields-of-study, and affiliations during LM pre-training; S2AND (Subramanian et al., 2021) further utilizes year, email, and position information for author name disambiguation.Nevertheless, all aforementioned models either consider typical LM pre-training tasks (e.g., MLM and NSP) only or focus on one additional task during model training (e.g., citation prediction).In comparison, SciMult exploits data from heterogeneous sources and proposes a multi-task learning framework that can be applied to a wide range of tasks.
Contrastive Learning and Multi-task Learning in the Scientific Domain.SPECTER (Cohan et al., 2020) and SciNCL (Ostendorff et al., 2022) are pioneering studies on using contrastive learning to enhance scientific LMs.They propose to derive positive and negative contrastive pairs from citation triplets and demonstrate the power of mining hard negatives.MICoL (Zhang et al., 2022) and CitationSum (Luo et al., 2023) adopt contrastive learning to multi-label classification and summarization of scientific papers, respectively.As for multi-task learning, Luan et al. (2018) propose a multi-task scientific knowledge graph construction framework by jointly identifying entities, relations, and coreference; Wang et al. (2019) treat multiple biomedical named entity recognition datasets (with different types of entities annotated) as multiple tasks so that they can mutually benefit each other.However, these studies do not have specific designs to tackle task interference.To the best of our knowledge, the recent work by Singh et al. (2022) is the most relevant one to SciMult, which pre-trains a scientific LM on various tasks and uses adapters (Houlsby et al., 2019) and control codes (Keskar et al., 2019) to produce task-aware paper representations.We believe our study is orthogonal to Singh et al. (2022) as the techniques considered by us, including the Mixture-of-Experts Transformer (Fedus et al., 2022) and instruction tuning (Wei et al., 2022), are distinct from theirs.

Conclusions
In this work, we propose to pre-train scientific LMs via multi-task contrastive learning.To mitigate task interference, we adopt two strategies: Task-aware specialization considers a Mixture-of-Experts Transformer architecture so that each task has its unique components, while instruction tuning relies on task-specific instructions to produce task-aware text representations.Extensive experiments on a comprehensive collection of benchmark datasets demonstrate the advantages of our models against competitive scientific LMs in extreme multi-label classification, link prediction, and search.In particular, we achieve the new state-of-the-art performance on the PMC-Patients leaderboard.We also show that task-specific MHA sub-layers are beneficial to the model performance across all examined tasks, whereas the benefit of other proposed techniques is not consistent.Further analysis validates some of our design choices, such as hard negative mining in extreme classification.

A.3 Search
The SciRepEval benchmark (Singh et al., 2022) is available at https://github.com/allenai/scirepeval.The BEIR benchmark (Thakur et al., 2021) is available at https://github.com/beir-cellar/beir.SciRepEval-Search.The original data scores each query-paper pair (q, p) in the range of 0 to 14 according to user click-through events from a scholarly search engine.The training and validation sets of SciRepEval-Search are put into our pre-training data, where we treat all (q, p) pairs with a positive score as positive (q, p) pairs.The testing set of SciRepEval-Search is utilized for in-domain evaluation.

B Baseline Details
• SciBERT (Beltagy et al., 2019) is an LM pretrained on scientific text using masked language modeling (MLM) and next sentence prediction (NSP).
• SentBERT (Reimers and Gurevych, 2019) is a general-domain LM that leverages negative sampling to fine-tune BERT for producing better sentence embeddings.
• SPECTER (Cohan et al., 2020) uses paper citations to generate positive and negative samples for contrastive fine-tuning of SciBERT.
• PubMedBERT (Gu et al., 2021) is a biomedical LM pre-trained on PubMed papers using MLM and NSP.We use the checkpoint pre-trained on abstracts rather than that on full texts because the former one performs better on the majority of our evaluation tasks.
• LinkBERT and BioLinkBERT (Yasunaga et al., 2022) leverage a Cross-Encoder architecture that concatenates two linked text segments together and are trained through MLM and NSP on Wikipedia and PubMed, respectively.
• OAG-BERT (Liu et al., 2022) is an entityaugmented scientific LM pre-trained on both academic texts and their associated metadata entities (e.g., venues, authors) through masked entity prediction.
• SciNCL (Ostendorff et al., 2022) advances the sampling strategy of SPECTER to create higherquality positives and negatives for neighborhood contrastive learning.
• SPECTER 2.0 (Singh et al., 2022) is the successor to SPECTER pre-trained on a much larger collection of citation prediction triplets and more diverse tasks from the SciRepEval benchmark (Singh et al., 2022).We adopt SPECTER 2.0-Adapters to generate task-specific embeddings for different tasks.
For all baselines, we set the similarity function as sim(q, c) = cos(q, c) where q and c are query and candidate embeddings, respectively, after LM encoding.(Except for reranking tasks on SciDocs, where the evaluation code explicitly sets sim(q, c) = −||q − c|| 2 .) When using SentBERT and OAG-BERT, we take the average of all token embeddings to represent the entire input sequence because this leads to significantly better performance; when using other baselines above, we take the [CLS] embedding.
For SPECTER 2.0, we try to use the most proper variant in each task.To be specific, following Singh et al. (2022), for coarse-grained classification, we use the classification adapter; for link prediction, we use the proximity adapter; for search, we use the adhoc query adapter to encoder queries and the proximity adapter to encode candidate papers.For fine-grained classification, we test both the classification adapter and the proximity adapter, the latter of which achieves better performance on average, so we choose the proximity adapter.

C Hyperparameter Configurations of SciMult
For pre-training, we use the AdamW optimizer (Loshchilov and Hutter, 2019) with (β 1 , β 2 ) = (0.9, 0.999) and warm up the learning rate for the first 5% of the steps.We train the model for 20 epochs with a peak learning rate of 3e-4 and a weight decay of 0.01.

D More Results on Fine-grained Classification
In this section, we show the fine-grained classification performance of compared methods on MAPLE (Zhang et al., 2023b) without the label name matching heuristic.The results are in Table A3.We find that: (1) In comparison with the results in Table 3, after removing the label name matching heuristic, all models perform consistently worse.This observation validates the effectiveness of the heuristic proposed in Zhang et al. (2023b).( 2) Most findings drawn from Table 3 still hold in Table A3.For example, all SciMult variants can outperform all baselines in terms of the average metric; SciMult-MHAExpert is always the best on cross-domain evaluation datasets.

E Analysis
In this section, we analyze some design choices of SciMult through controlled experiments.

E.1 Effect of Hard Negatives
We first demonstrate the contribution of hard negatives in our contrastive learning framework.Since MAPLE (Zhang et al., 2023b) Fine-grained classification CS-Conference Chemistry-MeSH Geography Psychology R@20 R@50 R@100 R@20 R@50 R@100 R@20 R@50 R@100 R@20 R@50 R@100 Average BM25 (Robertson and Walker, 1994) 17 the benefit of using hard negatives in citation prediction has been reported in Cohan et al. (2020) and Ostendorff et al. (2022), we mainly show how hard negative mining in fine-grained classification (proposed in subsection 3.4) improves the performance.Table A4 shows the average metrics of SciMult-Vanilla when trained with hard negatives of classification and without them (but still with hard negatives of the other tasks).We find that SciMult-Vanilla (with Hard Negative) outperforms SciMult-Vanilla (without Hard Negative) in most tasks, including not only classification but also link prediction and search.This observation indicates that the general quality of the learned text representations is enhanced after utilizing hard negatives of classification.

E.2 Effect of the Pre-training Strategy
As mentioned in Appendix C, when pre-training non-Vanilla variants of SciMult, we follow a twostage strategy: We first warm up the LM by training a Vanilla variant and then apply task-aware specialization or instruction tuning, hoping the first stage learns common knowledge and the second stage focuses on task-specific skills.We now explore the effect of this strategy by comparing

SciMult-MHAExpert with an ablation version:
We skip the warm-up stage and directly use Pub-MedBERT as the initial checkpoint to start the second stage.Table A5 demonstrates the performance of SciMult-MHAExpert (with Warm-up) and SciMult-MHAExpert (without Warm-up).We find that the warm-up strategy has a positive contribution across all tasks.

E.3 Effect of Instructions
We now explore the effect of using different instructions during inference.As shown in Table 1, during pre-training, we use the following instruction for the link prediction task: "Find a pair of scientific papers that one paper cites the other." We call it the 'Cite" instruction.Different from the pre-training data in which links are always citation links, the evaluation datasets consist of other link types.Therefore, we consider the following instructions: "Find a pair of scientific papers that are co-viewed frequently.""Find a pair of scientific papers that

Figure 1 :
Figure 1: Two different types of Mixture-of-Experts Transformer architecture.They will route the input to different MHA and FFN sub-layers, respectively, when considering different tasks.
) When pre-training non-Vanilla variants, warming up the LM by training a Vanilla variant during initial steps yields better performance.(3) Using some other reasonable instructions during inference does not significantly affect model performance.(4) Although label definitions are used for paper classification during pre-training and are beneficial to the classification performance during inference, our model can still outperform baselines in classification by taking label names as the only input.

Table 2 :
Datasets used for pre-training, in-domain evaluation, and cross-domain evaluation.

Table 5 )
, we also test

Table 6 :
Comparison between SciMult-MHAExpert and models on the leaderboard of PMC-Patients (Zhao et al., 2022) Patient-to-Article Retrieval and Patient-to-Patient Retrieval tasks.SciMult-MHAExpert achieves the new state-of-the-art performance.

Table 9 :
Relative performance change of different SciMult variants in comparison with SciMult-Vanilla in terms of the average evaluation metric.

Table A1 :
Statistics of Pre-training Data.

Table A2 :
Statistics of Evaluation Datasets.
TREC-COVID.We directly use the original testing set.In the SciRepEval version, each query has multiple segments separated by [SEP].

Table A3 :
(Zhang et al., 2023b)cation performance on MAPLE(Zhang et al., 2023b)when the compared methods do not use the "label name matching" heuristic.

Table A4 :
Average metrics of SciMult-Vanilla with and without hard negatives in each evaluation task.

Table A5 :
Average metrics of SciMult-MHAExpert with different pre-training strategies in each evaluation task.