Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings

Learning scientific document representations can be substantially improved through contrastive learning objectives, where the challenge lies in creating positive and negative training samples that encode the desired similarity semantics. Prior work relies on discrete citation relations to generate contrast samples. However, discrete citations enforce a hard cut-off to similarity. This is counter-intuitive to similarity-based learning and ignores that scientific papers can be very similar despite lacking a direct citation - a core problem of finding related research. Instead, we use controlled nearest neighbor sampling over citation graph embeddings for contrastive learning. This control allows us to learn continuous similarity, to sample hard-to-learn negatives and positives, and also to avoid collisions between negative and positive samples by controlling the sampling margin between them. The resulting method SciNCL outperforms the state-of-the-art on the SciDocs benchmark. Furthermore, we demonstrate that it can train (or tune) language models sample-efficiently and that it can be combined with recent training-efficient methods. Perhaps surprisingly, even training a general-domain language model this way outperforms baselines pretrained in-domain.


Introduction
Large pretrained language models (LLMs) achieve state-of-the-art results through fine-tuning on many NLP tasks (Rogers et al., 2020).However, the sentence or document embeddings derived from LLMs are of lesser quality compared to simple baselines like GloVe (Reimers and Gurevych, 2019), as their embedding space suffers from being anisotropic, i.e. poorly defined in some areas (Li et al., 2020).
One approach that has recently gained attention is the combination of LLMs with contrastive finetuning to improve the semantic textual similarity between document representations (Wu et al., 2020;Gao et al., 2021).These contrastive methods learn to distinguish between pairs of similar and dissimilar texts (positive and negative samples).As recent works show (Tian et al., 2020b;Rethmeier and Augenstein, 2022b,a;Shorten et al., 2021), the selection of these positive and negative samples is crucial for efficient contrastive learning.
This paper focusses on learning scientific document representations (SDRs).The core distinguishing feature of this domain is the presence of citation information that complement the textual information.The current state-of-the-art SPECTER by Cohan et al. (2020) uses citation information to generate positive and negative samples for contrastive fine-tuning of a SciBERT language model (Beltagy et al., 2019).SPECTER relies on 'citations by the query paper' as a discrete signal for similarity, i.e., positive samples are cited by the query while negative ones are not cited.
However, SPECTER's use of citations has its pitfalls.Considering only one citation direction may cause positive and negative samples to collide since a paper pair could be treated as a positive and negative instance simultaneously.Also, relying on a single citation as a discrete similarity signal is subject to noise, e.g., citations may reflect politeness and policy rather than semantic similarity (Pasternack, 1969) or related papers lack a direct citation (Gipp and Beel, 2009).This discrete cutoff to similarity is counter-intuitive to (continuous) similarity-based learning.
Instead, the generation of non-colliding contrastive samples should be based on a continuous similarity function that allows us to find semantically similar papers, even without direct citations.With SciNCL, we address these issues by generating contrastive samples based on citation embeddings.The citation embeddings, which incorporate the full citation graph, provide a continuous, undirected, and less noisy similarity signal that allows the generations of arbitrary difficult-to-learn positive and negative samples.

Contributions:
• We propose neighborhood contrastive learning for scientific document representations with citation graph embeddings (SciNCL) based on contrastive learning theory insights.• We sample positive (similar) and negative (dissimilar) papers from the k nearest neighbors in the citation graph embedding space, such that positives and negatives do not collide but are also hard to learn.• We compare against the state-of-the-art approach SPECTER (Cohan et al., 2020) and other strong methods on the SCIDOCS benchmark and find that SciNCL outperforms SPECTER on average and on 9 of 12 metrics.• Finally, we demonstrate that with SciNCL, using only 1% of the triplets for training, starting with a general-domain language model, or training only the bias terms of the model is sufficient to outperform the baselines.• Our code and models are publicly available.1

Related Work
Contrastive Learning pulls representations of similar data points (positives) closer together, while representations of dissimilar documents (negatives) are pushed apart.A common contrastive objective is the triplet loss (Schroff et al., 2015) that Cohan et al. (2020) used for scientific document representation learning, as we describe below.However, as Musgrave et al. (2020) and Rethmeier and Augenstein (2022b) point out, contrastive objectives work best when specific requirements are respected.(Req. 1) Views of the same data should introduce new information, i.e. the mutual information between views should be minimized (Tian et al., 2020b).We use citation graph embeddings to generate contrast label information that supplements text-based similarity.(Req.2) For training time and sample efficiency, negative samples should be hard to classify, but should also not collide with positives (Saunshi et al., 2019).(Req.3) Recent works like Musgrave et al. (2020) and Khosla et al. (2020) use multiple positives.However, positives need to be consistently close to each other (Wang and Isola, 2020), since positives and negatives may otherwise collide, e.g., Cohan et al. (2020) consider only 'citations by the query' as similarity signal and not 'citations to the query'.Such unidirectional similarity does not guarantee that a negative paper (not cited by the query) may cite the query paper and thus could cause collisions, the more we sample (Appendix F.10).Our method treats both citing and being cited as positives (Req.2), while it also generates hard negatives and hard positives (Req.2+3).Hard negatives are close to but do not overlap positives (red band in Fig. 1).Hard positives are close, but not trivially close to the query document (green band in Fig. 1).The sample induced margin (space between red and green band in Fig. 1) ensures that contrast samples do not collide.
Triplet Mining remains a challenge in NLP due to the discrete nature of language which makes data augmentation less trivial as compared to computer vision (Gao et al., 2021).Examples for augmentation strategies are translation, word deletion, or word reordering (Fang et al., 2020;Wu et al., 2020).Positives and negatives can be sampled based on the sentence position within a document (Giorgi et al., 2021).Gao et al. (2021) utilize supervised entailment datasets for the triplet generation.Language-and text-independent approaches are also applied.Kim et al. (2021) use intermediate BERT hidden state for positive sampling and Wu et al. (2021) add noise to representations to obtain negative samples.Xiong et al. (2020) present an approach similar to SciNCL where they sample hard negatives from the k nearest neighbors in the embedding space derived from the previous model checkpoint.While Xiong et al. rely only on textual data, SciNCL integrates also citation information which are especially valuable in the scientific context as Cohan et al. (2020) have shown.
Aside from text, citations are a valuable signal for the similarity of research papers.Paper (node) representations can be learned using the citation graph (Wu et al., 2019;Perozzi et al., 2014;Grover and Leskovec, 2016).Especially for recommendations of papers or citations, hybrid combinations of text and citation features are often employed (Han et al., 2018;Jeong et al., 2020;Brochier et al., 2019;Yang et al., 2015;Holm et al., 2022).
Closest to SciNCL are Citeomatic (Bhagavatula et al., 2018) and SPECTER (Cohan et al., 2020).While Citeomatic relies on bag-of-words for its textual features, SPECTER is based on SciBERT.Both leverage citations to learn a triplet-based document embedding model, whereby positive samples are papers cited in the query.Easy negatives are random papers not cited by the query.Hard negatives are citations of citations -papers referenced in positive citations of the query, but are not cited directly by it.Citeomatic also uses a second type of hard negatives, which are the nearest neighbors of a query that are not cited by it.
Unlike our approach, Citeomatic does not use the neighborhood of citation embeddings, but instead relies on the actual document embeddings from the previous epoch.Despite being related to SciNCL, the sampling approaches employed in Citeomatic and SPECTER do not account for the pitfalls of using discrete citations as signal for paper similarity.Our work addresses this issue.
Cross-Modal Transfer.SciNCL transfers knowledge across modalities, i.e., from citations into a language model.According to Cohan et al. (2020), SciNCL can be considered as a "citation-informed Transformer".This cross-modal transfer learning is applied for various modalities (see Kaur et al. (2021) for an overview): text-toimage (Socher et al., 2013), RGB-to-depth image (Tian et al., 2020a), or graph-to-image (Wang et al., 2018).While the aforementioned methods incorporate cross-modal knowledge through joint loss functions or latent representations, SciNCL transfers knowledge through the contrastive sample selection, which we found superior to the direct transfer approach (Appendix F.9).

Methodology
Our goal is to learn citation-informed representations for scientific documents.To do so we sample three document representation vectors and learn their similarity.For a given query paper vector d Q , we sample a positive (similar) paper vector d + and a negative (dissimilar) paper vector d − .This produces a 'query, positive, negative' triplet ( , , ) in Fig. 1.To learn paper similarity, we need to define three components: ( §3.1) how to calculate document vectors d for the loss over triplets L; ( §3.2) how citations provide similarity between papers; and ( §3.3) how negative and positive papers (d − , d + ) are sampled as (dis-)similar documents from the neighborhood of a query paper d Q .

Contrastive Learning Objective
Given the textual content of a document d (paper), the goal is to derive a dense vector representation d that best encodes the document information and can be used in downstream tasks.A Transformer language model f (SciBERT; Beltagy et al. (2019)) encodes documents d into vector representations f (d) = d.The input to the language model is the title and abstract separated by the [SEP] token. 2he final layer hidden state of the [CLS] token is then used as a document representation Training with a masked language modeling objectives alone has been shown to produce suboptimal document representations (Li et al., 2020;Gao et al., 2021).Thus, similar to the SDR stateof-the-art method SPECTER (Cohan et al., 2020), we continue training the SciBERT model (Beltagy et al., 2019) using a self-supervised triplet margin loss (Schroff et al., 2015): Here, ξ is a slack term (ξ = 1 as in SPECTER) and ∆d 2 is the L 2 norm, used as a distance function.However, the SPECTER sampling method has significant drawbacks.We will describe these issues and our contrastive learning theory guided improvements in detail below in §3.2.

Citation Neighborhood Sampling
Compared to the textual content of a paper, citations provide an outside view on a paper and its relation to the scientific literature (Elkiss et al., 2008), which is why citations are traditionally used as a similarity measure in library science (Kessler, 1963;Small, 1973).However, using citations as a discrete similarity signal, as done in Cohan et al. (2020), has its pitfalls.Their method defines papers cited by the query as positives, while paper citing the query could be treated as negatives.This means that positive and negative learning information collides between citation directions, which Saunshi et al. (2019) have shown to deteriorate performance.Furthermore, a cited paper can have a low similarity with the citing paper given the many motivations a citation can have (Teufel et al., 2006).Likewise, a similar paper might not be cited.
To overcome these limitations, we learn citation embeddings first and then use the citation neighborhood around a given query paper d Q to construct similar (positive) and dissimilar (negative) samples for contrast by using the k nearest neighbors.This builds on the intuition that nodes connected by edges should be close to each other in the embedding space (Perozzi et al., 2014).Using citation embeddings allows us to: (1) sample paper similarity on a continuous scale, which makes it possible to: (2) define hard to learn positives, as well as (3) hard or easy to learn negatives.Points (2-3) are important in making contrastive learning efficient as will describe below in §3.3.

Positives and Negatives Sampling
Positive samples: d + should be semantically similar to the query paper d Q , i.e. sampled close to the query embedding d Q .Additionally, as Wang and Isola (2020) find, positives should be sampled from comparable locations (distances from the query) in embedding space and be dissimilar enough from the query embedding, to avoid gradient collapse (zero gradients).Therefore, we sample c + positive (similar) papers from a close neighborhood around query embedding d i.e. the green band in Fig. 1.When sampling with KNN search, we use a small k + to find positives and later analyze the impact of k + in Fig. 2.
Negative samples: can be divided into easy and hard negative samples (light and dark red in Fig. 1).Sampling more hard negatives is known to improve contrastive learning (Bucher et al., 2016;Wu et al., 2017).However, we make sure to sample hard negatives (red band in Fig. 1) such that they are close to potential positives but do not collide with positives (green band), by using a tunable 'sampling induced margin'.We do so, since Saunshi et al. (2019) showed that sampling a larger number of hard negatives only improves performance if the negatives do not collide with positive samples, since collisions make the learning signal noisy.That is, in the margin between hard negatives and positives we expect positives and negatives to collide, thus we avoid sampling from this region.To generate a diverse self-supervised citation similarity signal for contrastive SDR learning, we also sample easy negatives that are farther from the query than hard negatives.For negatives, the k − should be large when sampling via KNN to ensure samples are dissimilar from the query paper.

Sampling Strategies
As described in §3.2 and §3.3, our approach improves upon the method by Cohan et al. (2020).Therefore, we reuse their sampling parameters (5 triplets per query paper) and then further optimize our methods' hyperparameters.For example, to train the triplet loss, we generate the same amount of (d Q , d + , d − ) triplets per query paper as SPECTER (Cohan et al., 2020).To be precise, this means we generate c + =5 positives (as explained in §3.3).We also generate 5 negatives, three easy negatives c − easy =3 and two hard negatives c − hard =2, as described in §3.3.
Below, we describe three strategies (I-III) for sampling triplets.These either sample neighboring papers from citation embeddings (I), by random sampling (II), or using both strategies (III).For each strategy, let c be the number of samples for either positives c + , easy negatives c − easy , or hard negatives c − hard .
Citation Graph Embeddings: We train a graph embedding model f c on citations extracted from the Semantic Scholar Open Research Corpus (S2ORC; Lo et al., 2020) to get citation embeddings C. We utilize PyTorch BigGraph (Lerer et al., 2019), which allows for training on large graphs with modest hardware requirements.The resulting graph embeddings perform well using the default training settings from Lerer et al. (2019), but given more computational resources, careful tuning may produce even better-performing embeddings.Nonetheless, we conducted a narrow parameter search based on link prediction -see Appendix D. (II) Random sampling: Sample any c papers without replacement from the corpus.
(III) Filtered random: Like (II) but excluding the papers that are retrieved by KNN, i.e., all neighbors within the largest k are excluded.This is analog to SPECTER's approach of selecting random candidates that are not cited by the query.
The KNN sampling introduces the hyperparameter k that allows for the controlled sampling of positives or negatives with different difficulty (from easy to hard depending on k).Specifically, in Fig. 1 the hyperparameter k defines the tunable sample induced margin between positives and negatives, as well as the width and position of the positive sample band (green) and negative sample band (red) around the query sample.Besides the strategies above, we experiment with similarity threshold, k-means clustering and sorted random sampling, neither of which performs well (Appendix F).

Experiments
In the following, we introduce our experiments including the data sets and implementation details.

Evaluation Dataset
We evaluate on the SCIDOCS benchmark (Cohan et al., 2020).A key difference to other benchmarks is that embeddings are the input to the individual tasks without explicit fine-tuning.The SCIDOCS benchmark consists of the following four tasks: Document classification (CLS) with Medical Subject Headings (MeSH) (Lipscomb, 2000) and Microsoft Academic Graph labels (MAG) (Sinha et al., 2015).Co-views and co-reads (USR) prediction based on the L2 distance between embeddings.Direct and co-citation (CITE) prediction based on the L2 distance between the embeddings.Recommendations (REC) generation based on embeddings and paper metadata.

Training Datasets
The experiments mainly compare SciNCL against SPECTER on the SCIDOCS benchmark.However, we found 40.5% of SCIDOCS's papers leaking into SPECTER's training data (the leakage affects only the unsupervised paper data but not the gold labels -see Appendix B).To be transparent about this leakage, we train SciNCL on two datasets: SPECTER replication (w/ leakage): We replicate SPECTER's training data including its leakage.Unfortunately, SPECTER provides neither citation data nor a mapping to S2ORC, which our citation embeddings are based on.We successfully map 96.2% of SPECTER's query papers and 83.3% of the corpus from which positives and negatives are sampled to S2ORC.To account for the missing papers, we randomly sample papers from S2ORC (without the SCIDOCS papers) such that the absolute number of papers is identical with SPECTER.

S2ORC subset (w/o leakage):
We select a random subset from S2ORC that does not contain any of the mapped SCIDOCS papers.This avoids SPECTER's leakage, but also makes the scores reported in Cohan et al. (2020) less comparable.We successfully map 98.6% of the SCIDOCS papers to S2ORC.Thus, only the remaining 1.4% of the SCIDOCS papers could leak into this training set.
The details of the dataset creation are described in Appendix A and C. Both training sets yield 684K triplets (same count as SPECTER).Also, the ratio of training triplets per query remains the same ( §3.4).Our citation embedding model is trained on the S2ORC citation graph.In w/ leakage, we include all SPECTER papers even if they are part of SCIDOCS, the remaining SCIDOCS papers are excluded (52.5 nodes and 463M edges).In w/o leakage, all mapped SCIDOCS papers are excluded (52.4M nodes and 447M edges) such that we avoid leakage also for the citation embedding model.

Model Training and Implementation
We replicate the training setup from SPECTER as closely as possible.We implement SciNCL using Huggingface Transformers (Wolf et al., 2020), initialize the model with SciBERT's weights (Beltagy et al., 2019), and train via the triplet loss (Equation 3.1).The optimizer is Adam with weight decay (Kingma and Ba, 2015;Loshchilov and Hutter, 2019) and learning rate λ=2 −5 .To explore the effect of computing efficient fine-tuning we also train a BitFit model (Ben Zaken et al., 2022) with λ=1 −4 ( §7.2).We train SciNCL on two NVIDIA GeForce RTX 6000 (24G) for 2 epochs (approx.24 hours of training time) with batch size 8 and gradient accumulation for an effective batch size of 32 (same as SPECTER).The graph embedding training is performed on an Intel Xeon Gold 6230 CPU with 60 cores and takes approx.6 hours.The KNN strategy is implemented with FAISS (Johnson et al., 2021) using a flat index (exhaustive search) and takes less than 30min for indexing and retrieval of the triplets.
Also, we compare against Oracle SciDocs which is identical to SciNCL except that its triplets are generated based on SCIDOCS's validation and test set using their gold labels.For example, papers with the same MAG labels are positives and papers with different labels are negatives.Similarly, the ground truth of the other tasks is used, i.e., clicked recommendations are considered as positives etc.In total, this procedure creates 106K training triplets for Oracle SciDocs.Moreover, we under-sample triplets from the classification tasks to ensure a balanced triplet distribution over the tasks.Accordingly, Oracle SciDocs represents an estimate for the performance upper bound that can be achieved with the current setting (triplet margin loss and SciBERT encoder).

Overall Results
Tab. 1 shows the results, comparing SciNCL with the best validation performance against the baselines.With replicated SPECTER training data (w/ leakage), SciNCL achieves an average performance of 81.8 across all metrics, which is a 1.8 point absolute improvement over SPECTER (the next-best baseline).When trained without leakage, the improvement of SciNCL over SPECTER is consistent with 1.7 points but generally lower (79.4 avg.score).In the following, we refer to the results obtained through training on the replicated SPECTER data (w/ leakage) if not otherwise mentioned.
We find the best validation performance based on SPECTER's data when positives and hard negative are sampled with KNN, whereby positives are k + =25, and hard negatives are k − hard =4000 ( §6).Easy negatives are generated through filtered random sampling.SciNCL's scores are reported as mean over ten random seeds (seed ∈ [0, 9]).
For MAG classification, SPECTER achieves the best result with 82.0 F1 followed by SciNCL with 81.4 F1 (-0.6 points).For MeSH classification, SciNCL yields the highest score with 88.7 F1 (+2.3 compared to SPECTER).Both classification tasks have in common that the chosen training settings lead to over-fitting.Changing the training by using only 1% training data, SciNCL yields 82.2 F1@MAG (Tab.2).In all user activity and citation tasks, SciNCL yields higher scores than all baselines.Moreover, SciNCL outperforms SGC on direct citation prediction, where SGC outperforms SPECTER in terms of nDCG.On the recommender task, SPECTER yields the best P@1 with 20.0, whereas SciNCL achieves 19.3 P@1 (in terms of nDCG SciNCL and SPECTER are on par).
When training SPECTER and SciNCL without leakage, SciNCL outperforms SPECTER even in 11 of 12 metrics and is on par in the other metric.This suggests that SciNCL's hyperparameters have a low corpus dependency since they were only optimized on the corpus with leakage.
Regarding the LLM baselines, we observe that the general-domain BERT, with a score of 63.4, outperforms the domain-specific BERT variants, namely SciBERT (59.6) and BioBERT (58.8) anisotropy problem of embeddings directly extracted from current LLMs and highlights the advantage of combining text and citation information.
In summary, we show that SciNCL's triplet selection leads on average to a performance improvement on SCIDOCS, with most gains being observed for user activity and citation tasks.The gain from 80.0 to 81.8 is particularly notable given that even Oracle SciDocs yields with 83.0 an only marginally higher avg.score despite using test and validation data from SCIDOCS for the triplet selection.Appendix H shows examples of paper triplets.

Impact of Sample Difficulty
In this section, we present the optimization of SciNCL's sampling strategy ( §3.3).We optimize the sampling for positives and hard or easy negatives with partial grid search on a random sample of 10% of the replicated SPECTER training data (sampling based on queries).Our experiments show that optimizations on this subset correlate with the entire dataset.The validation scores in Fig. 2 and 3 are reported as the mean over three random seeds.

Positive Samples
Fig. 2 shows the avg.scores on the SCIDOCS validation set depending on the selection of positives with the KNN strategy.We only change k + , while negative sampling remains fixed to its best setting ( §6.2).The performance is relatively stable for k + <100 with peak at k + =25, for k + >100 the performance declines as k + increases.Wang and Isola (2020)  similar to the query.For example, at k + =5, positives may be a bit "too easy" to learn, such that they produce less informative gradients than the optimal setting k + =25.Similarly, making k + too large leads to the sampling induced margin being too small, such that positives collide with negative samples, which creates contrastive label noise that degrades performance (Saunshi et al., 2019).
Another observation is the standard deviation σ: One would expect σ to be independent of k + since random seeds affect only the negatives.However, positives and negatives interact with each other through the triplet margin loss.Therefore, σ is also affected by k + .To account for the interaction of positives and negatives, one could sample simultaneously based on the distance to the query and the distance of positives and negatives to each other.

Hard Negative Samples
Fig. 3 presents the validation results for different k − hard given the best setting for positives (k + =25).The performance increases with increasing k − hard until a plateau between 2000<k − hard <4000 with a peak at k − hard =4000.This plateau can also be observed in the test set, where k − hard =3000 yields a marginally lower score of 81.7 (Tab.2).For k − hard >4000, the performance starts to decline again.This suggests that for large k − hard the samples are not "hard enough" which confirms the findings of Cohan et al. (2020).

Easy Negative Samples
Filtered random sampling of easy negatives yields the best validation performance compared pure random sampling (Tab.2).However, the performance difference is marginal.When rounded to one decimal, their average test scores are identical.The marginal difference is caused by the large corpus size and the resulting small probability of randomly sampling one paper from the KNN results.But without filtering, the effect of random seeds increases, since we find a higher standard deviation compared to the one with filtering.
As a potential way to decrease randomness, we experiment with other approaches like k-means clustering but find that they decrease the performance (Appendix F).

Collisions
Similar to SPECTER, SciNCL's sampling based on graph embeddings could cause collisions when selecting positives and negatives from regions close to each other.To avoid this, we rely on a sample induced margin that is defined by the hyperparameter k + and k − hard (distance between red and green band in Fig. 1).When the margin gets too small, positives and negatives are more likely to collide.A collision occurs when the paper pair (d q , d s ) is contained in the training data as positive and as negative sample at the same time.Fig. 4 demonstrates the relation between the number of collisions and the size of the sample induced margin.The number of collisions increases when the sample induced margin gets smaller.The opposite is the case when the margin is large enough (k − hard > 1000), i.e., then the number of collisions goes to zero.This relation also affects the evaluation performance as Fig. 2 and Fig. 3 show.Namely, for large k + or small k − hard SciNCL's performance declines and approaches SPECTER's performance.

Ablation Analysis
Next, we evaluate the impact of language model initialization and number of parameters and triples.

Initial Language Models
Tab. 2 shows the effect of initializing the model weights not with SciBERT but with general-domain LLMs (BERT-Base and BERT-Large) or with BioBERT.The initialization with other LLMs decreases the performance.However, the decline is marginal (BERT-Base -0.6, BERT-Large -0.4,BioBERT -0.4) and all LLMs outperform the SPECTER baseline.For the recommendation task, in which SPECTER is superior over SciNCL, BioBERT outperforms SPECTER.This indicates that the improved triplet mining of SciNCL has a greater domain adaption effect than pretraining on domain-specific literature.Given that pretraining of LLMs requires a magnitude more resources than the fine-tuning with SciNCL, our approach can be a solution for resource-limited use cases.

Data and Computing Efficiency
The last three rows of Tab. 2 show the results regarding data and computing efficiency.When keeping the citation graph unchanged but training the language model with only 10% of the original triplets, SciNCL still yields a score of 81.1 (-0.6).Even with only 1% (6840 triplets), SciNCL achieves a score of 80.8 that is 1.0 points less than with 100% but still 0.8 points more than the SPECTER baseline.With this textual sample efficiency, one could manually create triplets or use existing supervised datasets as in Gao et al. (2021).Lastly, we evaluate BitFit training (Ben Zaken et al., 2022), which only trains the bias terms of the model while freezing all other parameters.This corresponds to training only 0.1% of the original parameters.With BitFit, SciNCL yields a considerable score of 81.2 (-0.5 points).As a result, SciNCL could be trained on the same hardware with even larger (general-domain) language models ( §7.1).

Conclusion
We present a novel approach for contrastive learning of scientific document embeddings that addresses the challenge of selecting informative positive and negative samples.By leveraging citation graph embeddings for sample generation, SciNCL achieves a score of 81.8 on the SCIDOCS benchmark, a 1.8 point improvement over the previous best method SPECTER.This is purely achieved by introducing tunable sample difficulty and avoiding collisions between positive and negative samples, while existing LLM and data setups can be reused.This improvement over SPECTER can be also observed when excluding the SCIDOCS papers during training (see w/o leakage in Tab. 1).Furthermore, SciNCL's improvement from 80.0 to 81.8 is particularly notable given that even oracle triplets, which are generated with SCIDOCS's test and validation data, yield with 83.0 only a marginally higher score.
Our work highlights the importance of sample generation in a contrastive learning setting.We show that language model training with 1% of triplets is sufficient to outperform SPECTER, whereas the remaining 99% provide only 1.0 additional points (80.8 to 81.8).This sample efficiency is achieved by adding reasonable effort for sample generation, i.e., graph embedding training and KNN search.We also demonstrate that in-domain LLM pretraining (like SciBERT) is beneficial, while general-domain LLMs can achieve comparable performance and even outperform SPECTER.This indicates that controlling sample difficulty and avoiding collisions is more effective than indomain pretraining, especially in scenarios where training an LLM from scratch is infeasible.

Limitations
SciNCL's strategy of selecting positive and negative samples requires additional computational resources for training the graph embedding model, performing the KNN search, and optimizing the hyperparameters k + , k − hard ( §4.3).While some of the compute resources are offset by the sampleefficient language model training ( §7.2), we still consider the increased compute effort as the major limitation of the SciNCL method.
Especially the training of the graph embedding model accounts for most of the additional compute effort.This is also the reason for us providing only a shallow of evaluation of the graph embeddings (Appendix D).For example, we did not evaluate the effect of different graph embeddings on the actual SCIDOCS performance.Moreover, evaluations with smaller subsets of the S2ORC citation graph are missing.Such evaluations could indicate whether also less citation data can be sufficient, which would lower the compute requirements but would make SciNCL also applicable in domains where less graph data is available.
A Mapping to S2ORC Neither the SPECTER training data nor the Sci-Docs test data comes with a mapping to the S2ORC dataset, which we use for the training of the citation embedding model.However, to replicate SPECTER's training data and to avoid leakage of SciDocs test data such a mapping is needed.Therefore, we try to map the papers to S2ORC based on PDF hashes and exact title matches.The remaining paper metadata is collected through the Semantic Scholar API.Tab. 3 summarizes the outcome of mapping procedure.Failed mappings can be attributed to papers being unavailable through the Semantic Scholar API (e.g., retracted papers) or papers not being part of S2ORC citation graph.

B SPECTER-SciDocs Leakage
When replicating SPECTER (Cohan et al., 2020), we found a substantial overlap between the papers3 used during the model training and the papers from their SCIDOCS benchmark4 .In both datasets, papers are associated with Semantic Scholar IDs.Thus, no custom ID mapping as in Appendix A is required to identify papers that leak from training to test data.From the 311,860 unique papers used in SPECTER's training data, we find 79,201 papers (25.4%) in the test set of SCIDOCS and 79,609 papers (25.5%) in its validation set.When combining test and validation set, there is a total overlap of 126,176 papers (40.5%).However, this overlap affects only the 'unsupervised' paper metadata (title, abstract, citations, etc.) and not the gold labels used in SCIDOCS (e.g., MAG labels or clicked recommendations).

C Dataset Creation
As describe in §4.2, we conduct our experiments on two datasets.Both datasets rely on the cita-tion graph of S2ORC (Lo et al., 2020).More specifically, S2ORC with the version identifier 20200705v1 is used.The full citation graph consists of 52.6M nodes (papers) and 467M edges (citations).Tab. 4 presents statistics on the datasets and their overlap with SPECTER and SCIDOCS.The steps to reproduce both datasets are: Replicated SPECTER (w/ leakage) In order to replicate SPECTER's training data and do not increase the leakage, we exclude all SCIDOCS papers which are not used by SPECTER from the S2ORC citation graph.This means that apart from the 110,538 SPECTER papers not a single other SCIDOCS paper is included.The resulting citation graph has 52.5M nodes and 463M edges and is used for training the citation graph embeddings.
For the SciNCL triplet selection, we also replicate SPECTER's query papers and its corpus from which positive and negatives are sampled.Our mapping and the underlying citation graph allows us to use 227,869 of 248,007 SPECTER's papers for training.Regarding query papers, we use 131,644 of 136,820 SPECTER's query papers.To align the number training triplets with the one from SPECTER, additional papers are randomly sampled from the filtered citation graph.
Random S2ORC subset (w/o leakage) To avoid leakage, we exclude all successfully mapped SCIDOCS papers from the S2ORC citation graph.After filtering the graph has 52.3 nodes and 447M edges.The citation graph embedding model is trained on this graph.
Next, we reproduce triplet selection from SPECTER.Any random 136,820 query papers are selected from the filtered graph.For each query, we generate five positives (cited by the query), two hard negatives (citation of citation), and three random nodes from the filtered S2ORC citation graphs.This sampling produces 684,100 training triplets with 680,967 unique papers IDs (more compared to the replicated SPECTER dataset).Based on these triplets the SPECTER model for this dataset is trained with the same model settings and hyperparameters as SciNCL (second last row in Tab. 1).
Lastly, the SciNCL triplets are generated based on the citation graph embeddings of the same 680,967 unique papers IDs, i.e, the FAISS index contains only these papers and not the remaining S2ORC papers.Also, the same 136,820 query papers are used.

D Graph Embedding Evaluation
To evaluate the underlying citation graph embeddings, we experiment with a few of BigGraph's hyperparameters.We trained embeddings with different dimensions d={128, 512, 768} and different distance measures (cosine similarity and dot product) on 99% of the and the remaining 1% on the link prediction task.An evaluation of the graph embeddings with SCIDOCS is not possible since we could not map the papers used in SCIDOCS to the S2ORC corpus.All variations are trained for 20 epochs, margin m=0.15, and learning rate λ=0.1 (based on the recommended settings by Lerer et al. (2019)).Tab. 5 shows the link prediction performance measured in MRR, Hits@1, Hits@10, and AUC.Dot product is substantially better than cosine similarity as distance measure.Also, there is a positive correlation between the performance and the size of the embeddings.The larger the embedding size the better link prediction performance.Graph embeddings with d=768 were the largest possible size given our compute resources (available disk space was the limiting factor).

E Baseline Details
If not otherwise mentioned, all BERT variations are used in their base-uncased versions.
The weights for BERT (bert-base-uncased), BioBERT (biobert-base-cased-v1.2), CiteBERT (citebert), DeCLUTR (declutr-sci-base) are taken from Huggingface Hub5 .We use Universal Sentence Encoder (USE) from Tensorflow Hub6 .For Oracle SciDocs, we use the SciNCL implementation and under-sample the triplets from the classification tasks to ensure a balanced triplet distribution over the tasks.The SPECTER version for the random S2ORC training data (w/o leakage) is also trained with the SciNCL implementation.Please see Cohan et al. (2020) for additional baseline methods and their implementation details.

F Negative Results
We investigated additional sampling strategies and model modification of which none led to a significant performance improvement.

F.1 Undirected Citations
Our graph embedding model considers citations as directed edges by default.We also train a SciNCL model with undirected citations by first converting a single edge (a, b) into the two edges (a, b) and (b, a).This approach yields a slightly worse performance (81.7 avg.score; -0.1 points) and, therefore, was discarded for the final experiments.

F.2 KNN with interval large than c
Our best results are achieved with KNN where the size of the neighbor interval (k − c ; k] is equal to the number of samples c that the strategy should generate.In addition to this, we also experimented with large intervals, e.g., (1000; 2000], from which c papers are randomly sampled.This approach yields comparable results but suffers from a larger effect of randomness and is therefore more difficult to optimize.

F.3 K-Means Cluster for Easy Negatives
Easy negatives are supposed to be far away from the query.Random sampling from a large corpus ensures this as our results show.As an alternative approach, we tried k-means clustering whereby we selected easy negatives from the centroid that has a given distance to the query's centroid.However, this decreased the performance.

F.4 Sampling with Similarity Threshold
As alternative to KNN, we select samples based on cosine similarity in the citation embedding space.Take c papers that are within the similarity threshold t of a query paper d where s is the cosine similarity function.
For example, given the similarity scores S={0.9, 0.8, 0.7, 0.1} (ascending order, the higher the similarity is the closer the candidate embedding to the query embedding is) with c =2 and t=0.5, the two candidates with the largest similarity scores and larger than the threshold would be 0.8 and 0.7.The corresponding papers would be selected as samples.While the positive threshold t + should close to 1, the negative threshold t − should be small to ensure samples are dissimilar from d Q .However, the empirical results suggest that this strategy is inferior compared to KNN.Selecting hard negatives based on the similarity threshold yields a test score of 81.7 (-0.1 points).Fig. 5 show the validation results for different similarity thresholds.A similar pattern as in Fig. 3 can be seen.When the negatives are closer to the query paper (larger similarity threshold t), the validation score decreases.

F.6 Positives with Similarity Threshold
Positive sampling with SIM performs poorly since even for small t + < 0.5 many query papers do not have any neighbors within this similarity threshold (more than 40%).Solving this issue would require changing the set of query papers which we omit for comparability to SPECTER.

F.7 Sorted Random
Simple random sampling does not ensure if a sample is far or close to the query.To integrate a distance measure in the random sampling, we first sample n candidates, then order the candidates according to their distance to the query, and lastly select the c candidates that are the closest or furthest to the query as samples.
F.8 Mask Language Modeling Giorgi et al. (2021) show that combining a contrastive loss with a mask language modeling loss can improve text representation learning.However, in our experiments a combined function decreases the performance on SCIDOCS, probably due to the effects found by (Li et al., 2020).

F.9 Student-Teacher Learning
Student-teacher learning is effective in related work on cross-modal knowledge transfer (Kaur et al., 2021;Tian et al., 2020a).We also try to adopt this approach for our experiments, whereby the Transformer language model is the student, and the citation graph embedding model is the teacher.By directly learning from the citation embeddings, we could circumvent the positive and negative sampling needed for triplet loss learning, which introduces unwanted issues like collisions.Given a batch of document representations derived from text D T ext (through the language model) and the citation graph representations for the same documents D Graph , we compute the pairwise cosine similarity for both sets S T ext and S Graph .To transfer the knowledge from the citation embeddings into the language model, we devise the studentteacher loss L ST based on a mean-squared-error loss (MSE) such that the difference between the cosine similarities is minimized: Despite the promising results from Tian et al. (2020a), the student-teacher approach performs poorly in our experiments.We attribute this the overfitting to the citation data (the training loss approaches zero after a few steps while the validation loss remains high).The model trained with L ST yields only a SCIDOCS average score of 64.7, slightly better than SciBERT but substantially worse than SciNCL with triplet loss.
Additionally, we experiment with a joint loss that is the sum of triplet margin loss L T riplet (see §3.1) and the student-teacher loss L ST : Training with the joint loss L Joint achieves an average score of 80.5.Even though the joint loss is not subject to overfitting, its SCIDOCS performance is slightly worse than the triplet loss L T riplet alone.Given this outcome and that the computation of the cosine similarities adds additional complexity, we discard the student-teacher approach for the final experiments.

F.10 SPECTER & Bidirectional Citations
SPECTER (Cohan et al., 2020) relies on unidirectional citations for their sampling strategy.While papers cited by the query paper are considered as positives samples, those citing the query paper (opposite citation direction) could be negative samples.We see this use of citations as a conceptional flaw in their sampling strategy.
To test the actual effect on the resulting document representation, we first replicate the original unidirectional sampling strategy from SPECTER with our training data (see w/ leakage in §4.2).The resulting SPECTER model achieves an average score of 79.0 on SCIDOCS.7 When changing the sampling strategy from unidirectional to bidirectional ('citations to the query' are also treated as a signal for similarity), we observe an improvement of +0.4 points to 79.4.Consequently, the use of unidirectional citations is not only a conceptional issue but also degrades learning performance.

G Task-specific Results
Fig. 6 and 7 present the validation performance like in §6 but on a task-level and not as an average over all tasks.The plots show that the optimal k + and k − hard values are partially task dependent.

Figure 1 :
Figure 1: Starting from a query paper in a citation graph embedding space.Hard positives are citation graph embeddings that are sampled from a similar (close) context of , but are not so close that their gradients collapse easily.Hard (to classify) negatives (red band) are close to positives (green band) up to a sampling induced margin.Easy negatives are very dissimilar (distant) from the query paper .

(
I) K-nearest neighbors (KNN): Assuming a given citation embedding model f c and a search index (e.g., FAISS §4.3), we run KN N (f c (d Q ), C) and take c samples from a range of the (k − c , k] nearest neighbors around the query paper d Q with its neighbors N ={n 1 , n 2 , n 3 , . . .}, whereby neighbor n i is the i-th nearest neighbor in the citation embedding space.For instance, for c =3 and k=10 the corresponding samples would be the three neighbors descending from the tenth neighbor: n 8 , n 9 , and n 10 .To reduce computing effort, we sample the neighbors N only once via [0; max(k + , k − hard )], and then generate triplets by range-selection in N ; i.e. positives = (k + − c + ; k + ], and hard negatives = (k − hard − c − hard ; k − hard ].

Figure 2 :
Figure 2: Results on the validation set w.r.t.positive sampling with KNN when using 10% training data.

Figure 4 :
Figure 4: Number of collisions w.r.t.size of the sample induced margin as defined through k + and k − hard .

Figure 5 :
Figure 5: Results on the validation set w.r.t.hard negative sampling with SIM using 10% training data.
that the five positives are the (k + − 5; k + ] that the five positives are the (k + − 5; k + ] that the five positives are the (k + − 5; k + ] nearest neighbors 93that the five positives are the (k + − 5; k + ] nearest neighbors 35

Table 1 :
. LLMs without citations or contrastive objectives yield generally poor results.This emphasizes the Results on the SCIDOCS test set.With replicated SPECTER training data, SciNCL surpasses the previous best avg.score by 1.8 points and also outperforms the baselines in 9 of 12 task metrics.Our scores are reported as mean and standard deviation σ over ten random seeds.With training data randomly sampled from S2ORC, SciNCL outperforms SPECTER in terms of avg.score with 1.7 points.The scores with * are fromCohan et al.
(2020).Oracle SciDocs † is the upper bound of the performance with triplets from SCIDOCS's data.

Table 2 :
Ablations.Numbers are averages over tasks of the SCIDOCS test set, average score over all metrics, and rounded absolute difference to SciNCL.

Table 3 :
Mapping to S2ORC citation graph

Table 4 :
Statistics for our two datasets and their overlap with SPECTER and SciDocs respectively.

Table 5 :
Link prediction performance of BigGraph embeddings trained on S2ORC citation graph with different dimensions and distance measures.