The Inefficiency of Language Models in Scholarly Retrieval: An Experimental Walk-through

Language models are increasingly becoming popular in AI-powered scientific IR systems. This paper evaluates popular scientific language models in handling (i) short-query texts and (ii) textual neighbors. Our experiments showcase the inability to retrieve relevant documents for a short-query text even under the most relaxed conditions. Additionally, we leverage textual neighbors, generated by small perturbations to the original text, to demonstrate that not all perturbations lead to close neighbors in the embedding space. Further, an exhaustive categorization yields several classes of orthographically and semantically related, partially related and completely unrelated neighbors. Retrieval performance turns out to be more influenced by the surface form rather than the semantics of the text.


Introduction
Representation learning methods have drastically evolved large scientific volume exploration strategies.The popular applications include summarization, construction of mentor-mentee network (Ke et al., 2021), recommendation (Ostendorff et al., 2020;Cohan et al., 2020;Das et al., 2020;Hope et al., 2021), QA over scientific documents (Su et al., 2020), and verification of scientific claims (Wadden et al., 2020).The growing community's interest has led to the development of several scientific document embedding models such as OAG-BERT (Liu et al., 2021), SPECTER (Cohan et al., 2020), SciBERT (Beltagy et al., 2019), and BioBERT (Lee et al., 2020) over the past five years.OAG-BERT has been deployed in the Aminer production system.Given similar possibilities of future deployments of scientific document embeddings models in the existing scholarly systems, it is crucial to evaluate and identify limitations robustly.However, we do not find any existing work that critically analyzes the scientific language models to the best of our knowledge.
To motivate the reader, we present a simple experiment.Queries 'document vector' and 'document vectors' fetch no common candidates among first page results on Google Scholar and Semantic Scholar (candidates in Appendix A).This illustrates the extremely brittle nature of such systems to minor alterations in query text, leading to completely different search outcomes.To motivate further, we experiment with textual queries encoded by the popular SciBERT model.A perturbed text 'documen vector' (relevant in AI) is closer to the Biomedical term 'Virus vector' in the embedding space.Similar observations were found for many other queries.As scholarly search and recommendation systems are complex systems and their detailed algorithms are not publicly available, we analyze the behavior of scientific language models, which are (or will be) presumably an integral component of each of these systems.Motivated by the usage of perturbed inputs to stress test ML systems in interpretability analysis, we propose to use 'textual neighbors' to analyze how they are represented in the embedding space of scientific LMs.Unlike previous works (Ribeiro et al., 2020;Rychalska et al., 2019), which analyze the effect of perturbations on downstream task-specific models, we focus on analyzing the embeddings which are originally inputs to such downstream models.With the explosion of perturbation techniques for various kinds of robustness and interpretability analysis, it is difficult to generalize the insights gathered from perturbation experiments.We propose a classification schema based on orthography and semantics, to organize the perturbation strategies.
The distribution of various types of textual neighbors in a training corpus is non-uniform.Specifically, the low frequency of textual neighbors results in non-optimized representations, wherein semantically similar neighbors might end up distant in the space.These non-optimal representations have a cascading effect on the downstream task-specific models.In this work, we analyze whether all textual neighbors of input X are also X's neighbors in the embedding space.Further, we also study if their presence in the embedding space can negatively impact downstream applications dependent on similarity-based document retrieval.Our main contributions are: 1. We introduce (in Section 3) five textual neighbor categories based on orthography and semantics.We further construct a nonexhaustive list of thirty-two textual neighbor types and organize them into these five categories to analyze the behavior of scientific LMs on manipulated text.2. We conduct (in Section 5) robust experiments to showcase the limitations of scientific LMs under a short-text query retrieval scheme.3. We analyze (in Section 6) embeddings of textual neighbors and their placement in the embedding space of three scientific LMs, namely SciBERT, SPECTER, and OAG-BERT.Our experiments highlight the capability and limitations of different models in representing different categories of textual neighbors.

Related Works
Several works utilize Textual Neighbors to interpret decisions of classifiers (Ribeiro et al., 2016;Gardner et al., 2020), test linguistic capabilities of NLP models (Ribeiro et al., 2020), measure progress in language generation (Gehrmann et al., 2021), and generate semantically equivalent adversarial examples (Ribeiro et al., 2018).Similar to these works, we use textual neighbors of scientific papers to analyze the behavior of scientific LMs.MacAvaney et al. (2020) analyze the behavior of neural IR models by proposing test strategies: constructing test samples by controlling specific measures (e.g., term frequency, document length), and by manipulating text (e.g., removing stops, shuffling words).This is closest to our work, as we also employ text manipulation to analyze the behavior of scientific language models using a simple Alternative-Self Retrieval scheme (Section 4).However, our focus is not evaluation of retrieval-augmented models and we only use a relaxed document retrieval pipeline in our evaluation to analyze the behavior of scientific LMs trained on diverse domains, in encoding scientific documents.We organize the textual neighbors into categories which capture different capabilities of LMs.We also show that it is crucial to evaluate models on dissimilar texts rather than just semantically similar textual neighbors.Ours is a first work in analyzing the properties of scientific LMs for different inputs and can be utilized by future works to design and evaluate future scientific LMs.Due to space limitations, we present the detailed discussion on scientific LMs in Appendix C.

Short Queries and Textual Neighbors
In this paper, we experiment with short queries to fetch relevant scientific documents.The term 'short' signifies a query length comparable to the length of research titles.The candidates are constructed from either title (T) or title and abstract (T+A) text.We, further, make small alterations to the candidate text to construct 'textual neighbors'.
The textual neighbors can be syntactically, semantically, or structurally similar to the candidate text.Unlike previous works that explore textual neighbors to analyze and stress test complex models (Q&A, Sentiment, NLI, NER (Rychalska et al., 2019)), we experiment directly with representation learning models and analyze the placement of textual neighbors in their embedding space.While semantically similar neighbors are frequently used in previous works (Ribeiro et al., 2018); we also explore semantically dissimilar textual neighbors to analyze scientific language models.While an LM is expected to represent semantically similar texts with high similarity, some orthographically similar but semantically dissimilar texts can have highly similar embeddings, which is undesirable behavior.Note that we restrict the current query set to titles for two main reasons: (i) most real-world search queries are shorter in length, and (ii) flat keyword-based search lacks intent and can lead to erroneous conclusions.Textual neighbors have a similar word form or similar meaning.We observe that Textual neighbors possess the following properties based on two aspects: (i) Orthography (surface-level information content of text in terms of characters, words, and sentences), and (ii) Semantics.These two aspects are integral for a pair of texts to be textual neighbors.The properties of these aspects are: 1. Orthography: Orthographic-neighbors are generated by making small surface-level transformations to the input.Based on the textual content, neighbors can be generated by: (a) Lossy Perturbation (LO): 'Lossy' behavior indicates that the information  Since SPECTER and OAG-BERT utilize paper titles and abstracts to learn embeddings, we generate textual neighbors by altering texts of these two input fields.We present 32 textual neighbor types in Table 1.Each of these 32 neighbor types is categorized into one of the five categories: LO-HS, LO-PS, LO-DS, LL-HS, and LL-PS.We exclude the LL-DS category (e.g., scrambling all words in the text) as it is infrequent and less probable to occur in a real-world setting.Examples of the textual neighbors are presented in Appendix B (Table 7).

Experiment Design
The Alternative-Self Retrieval: We propose an embarrassingly simple binary retrieval scheme   which contains only one relevant document in the candidate set.Alternative-Self Retrieval refers to the characteristic that the query is an altered version of the relevant candidate document.E.g., the candidate documents are the embeddings of paper title and abstract (henceforth T+A) and the query is embedding of title.We present a schematic of three Alternative-Self Retrieval schemes in Figure 1.The retrieval is simple and we measure performance with accommodating metrics discussed in this section further.Our purpose is to analyze scientific LM embeddings under the most relaxed conditions.The Datasets: We evaluate the scientific LMs on seven datasets (statistics in Table 2) to understand their effectiveness in encoding documents from diverse research fields.Each dataset contains the titles and abstracts of papers.We curate the ACL Anthology dataset1 and the ICLR dataset from Open-Review2 .To control the size of the ACL Anthology dataset, we exclude papers from workshops and non-ACL venues.We also curate five datasets from arXiv for the domains Mathematics (MT), High Energy Physics (HEP), Quantitative Biology (QB), Economics (ECO), and Computer Science (CS).
We make available our code and dataset for public access3 .The Notations: D is the set of seven datasets described in Table 2. X is the set of original input texts to the scientific LMs consisting of the paper title (T) and the abstract (A).For d ∈ D, X = {x j : x j = (T+A)(p), where (T+A)(p) = concat(title(p), abstract(p), ∀ paper p ∈ d}.f represents the type of textual neighbor (represented by the neighbor code presented in Table 1).Q and R are the query and candidate set for the IR task.
Evaluation Metrics: We report performance scores on the following retrieval metrics: Mean Reciprocal Rank (MRR): All our tasks use binary relevance of documents to compute MRR.T100: It represents the percentage of queries which retrieve the one and only relevant document among the top-100 documents.
NNk_Ret: % of queries in textual neighbor category whose k nearest neighbors (k-NN) retrieve the original document.AOP-10: Average overlap percentage among 10-NN of x and y, where x = T+A(x j ) and y = f (x j ).f is a text manipulation function, represented by the textual neighbor codes presented in Table 1.

Analysing Embeddings for Scientific Document Titles and Abstracts
In this section, we experiment with the inputs to the scientific language models.Due to free availability and ease of parsing paper title and abstract, majority scientific LMs learn embeddings from the title and the abstract of the paper.However, multiple downstream applications such as document search involve short queries (often keywords).We present two Alternative-Self retrieval experiments to compare the embeddings of paper title (T) with the embeddings of paper titles and abstract (T+A).In both experiments, |Q| = 1000 queries for each dataset.
Task I: Querying titles against original candidate documents In this Alternative-Self Retrieval setup, given a query q constructed only from the paper title, the system recommends relevant candidate document embeddings constructed from title and abstract both (T+A).This setting is similar to querying in a scientific literature search engine as the search queries are usually short.The  Table 3: For both Task I and Task II, SPECTER consistently performs the best on all datasets.For Task II, drop in MRR and T100 scores for SPECTER is significant in comparison to Task I.  motivation behind this experiment is to analyse the similarity among the T and the T+A embedding of a paper (Figure 1(a)).The experiment details are: Candidate Documents: R = {x k : x k ∈ X } Task: For q j ∈ Q, rank the candidates based on cosine similarity.Evaluation: MRR and T100.For each q j , there is only one relevant document in the candidate set which is the corresponding T+A embedding.We present the results for various models on different domains in Table 3.The results suggest that SciBERT performs poorly for all domains.OAG-BERT on an average ranks the original document in the range 5-7th position.However, we also observe that even in the best case, only 33% queries retrieve the original document in the top-100 retrieved candidates.SPECTER, on the other hand performs consistently better than both SciBERT and OAG-BERT.The MRR score suggests that on average, the original document is ranked in the top-2 documents, also has a good T100 score across all domains.However, for around 10% of the queries in the Arxiv-MATH, Arxiv-HEP, and ICLR datasets, SPECTER does not rank the original relevant document in the top-100 retrieved candidates.
Task II: Introducing all Titles in the Candidate Set To increase the complexity of the previous task, we add all the title embeddings (T) in the candidate set (Figure 1(b)).We test if the T embeddings are more similar to other titles, or to their corresponding T+A embeddings.
Query: Q = {q j : x j ∈ X , f = T, q j =f (x j )} Candidate Documents: For query q j , the candidate set R j is defined as, R j ={f (x i ) : The task and evaluation metrics are same as Task I. Extremely poor values for SciBERT (Table 3) lead us to examine the vector space of embeddings presented in Figure 2 (t-SNE plots for T and T+A embeddings), revealing that T and T+A embeddings form two non overlapping clusters.Even though the title text is a subset of T+A, SciBERT embeddings are significantly different, suggesting that input length influences SciBERT.This highlights the issue in retrieval for varying length query and candidates.We present the t-SNE plots for other datasets in the Appendix D.
SPECTER still performs the best, but a significant drop in MRR and T100 suggests that both T and T+A embeddings tightly cluster together in partially overlapping small groups (can be verified from Figure 2).However, comparable T100 for OAG-BERT to the previous experiment suggests that the model does not falter when the input text length is small.Ideally, we expect T and T+A embeddings to overlap, indicating that the embeddings of the same paper are closer.The pretraining of these models could be the reason for such distribution of T and T+A embeddings.SciBERT is trained on sentences from full text of research papers, leading to different representations for short (T) and longer (T+A) texts.As SPECTER and OAG-BERT are trained on title and abstract fields both, such non-overlapping behavior is not observed.

Analysing Scientific LMs with Textual Neighbors
In the previous section, we experimented with different input fields (T vs T+A).In this section, we experiment with the 32 textual neighbors classes (which alter different input fields: T, A, or T+A).We present our results for the following experiments for the five broad categories: LL-HS, LL-PS, LL-DS, LO-HS, and LO-DS.Due to space constraints, we present plots for selective datasets for each of the experiment in the paper.The rest of the plots are presented in Appendix E.

Distribution of Textual Neighbors in the Embedding Space
We measure how textual neighbor embeddings are distributed in the embedding space in each dataset when encoded by the SciBERT, SPECTER, and OAG-BERT model.For each textual neighbor class listed in Table 1, we compute pairwise similarities among all input pairs.A plot of the similarity values for different textual neighbor categories is presented in Figure 4 (additional plots in Appendix E.2).It can be observed that the pairwise similarities among documents are spread out in a significantly broader range for OAG-BERT than SciBERT and SPECTER on all datasets.The average similarity for all datasets by all models is above 0.5.We do not observe any significant difference in the average similarity for different textual neighbor classes.Interestingly, for the SPECTER model, the minimum similarity is greater than zero for all datasets across all neighbor categories.Document pair similarity via OAG-BERT embeddings have a low average similarity for the LO-DS category.
We present the percentage of pair of documents for each Textual Neighbor whose similarity is greater than the average similarity in Figure 3 (additional plots in Appendix E.2). OAG-BERT shows high inter similarity (greater than 50%) for majority of textual neighbors suggesting that more than 50% document pairs have a cosine similarity greater than average similarity.For SPECTER vectors, all types of textual neighbors have around 50% documents pairs having a similarity greater than mean similarity.However, extremely high values of percentage of document pairs having similarity greater than average similarity for the SciBERT model on the ACL dataset, and the OAG-BERT on almost all dataset suggests that majority of the documents in the embedding space are represented compactly for all textual neighbor categories.

Similarity of Textual Neighbors with Original Documents
Let F = {f 1 , f 2 , ..., f n } be the set of textual neighbor functions described in Table 1.We query different types of textual neighbor classes against the  documents embeddings (T+A).We compute the percentage of queries that successfully rank the original document in the top-1 and top-10 ranked list.We expect HS and PS categories to rank the original document higher in the rank list, and DS to rank it lower.If any of the textual neighbor classes or categories don't show the expected behavior, it can be asserted that the LM is brittle in representing the specific type of textual neighbor.
Candidate Documents: R = {x k : x k ∈ X } Task: For q j ∈ Q, retrieve the most similar documents based on cosine similarity.Evaluation: NN1_Ret and NN1_10.There is only one relevant document in the candidate set for each q j , which is the corresponding T+A embeddings.We present the results in Figure 5. SciBERT and OAG-BERT for the LO-DS category show less than 50% NN10_Ret, which is desirable as LO-DS neighbors are semantically dissimilar, and hence should not be neighbors in the embedding space.SciBERT shows improvement via NN10_Ret over NN1_Ret for PS categories.High NN1_Ret for HS categories indicates SciBERT successfully encodes highly similar texts closer than partially similar texts.OAG-BERT performs poorly for both metrics, indicating that it doesn't encode textual neighbors optimally.SPECTER embeddings perform poorly on the LO-DS category.They however achieve the maximum values showing no difference between LO vs LL, or HS vs PS categories.
To analyse the high values for SPECTER, we present the individual NN1_Ret for each of the 32 textual neighbor classes in Figure 6 and observe only two classes 'TDelNN' and 'T_A_DelNNChar' which lead to less than 90% NN1_Ret.Unlike SPECTER and OAG-BERT, SciBERT preserve the hierarchy, with Highly Similar classes ranked higher than Partially Similar classes.An interesting case with SciBERT embeddings is that the T_A_WS neighbor class belonging to the LL-HS category, has a low NN1_Ret value across all datasets, suggesting that the SciBERT model is extremely brittle to white space character perturbations (because of the constraint on sequence length).Another breaking point for the SciBERT is the textual neighbor class T_ARepADJ (replacing adjectives with antonyms) of LO-DS category, which shows high values (around 80%) for NN1_Ret which is undesirable.We observe that among the three specific LO-PS categories 'T_ADelQ1', 'T_ADelQ2', and 'T_ADelQ3', SciBERT performs worst for the 'T_ADelQ3', indicating that the last quantile of the abstract contains relevant information encoded by SciBERT.OAG-BERT shows a reverse trend to SPECTER by achieving low values for all neighbors classes indicating brittleness to text manipulation.

Overlap amongst Nearest Neighbors
We compute the overlap amongst the nearest neighbors of each textual neighbor class and the original document embeddings.We randomly sample a Candidate Documents: R = {q j : x j ∈ X , q j =f (x j )} ∪ {q j : x j ∈ X , f = T + A, q j =f (x j )} Task: For each pair of q j , q k ∈ Q, such that q j ∈ Q f and q k ∈ Q T A , compute the overlap among ten nearest neighbors (NN-10) of q j and q k .Evaluation: AOP-10.We present the results arranged by Textual Neighbor categories in

Conclusion
We propose five categories of textual neighbors to organize the increasing number of textual neighbor types: LL-HS, LO-HS, LL-PS, LO-PS, and LO-DS.We evaluate SciBERT, SPECTER, and OAG-BERT models on thirty-two textual neighbor classes organized into the previous five categories.
We show that evaluation of language models on 'Semantically Dissimilar' texts is also important rather than just evaluation on 'Semantically Similar' texts.
We show that the SciBERT model is highly sensitive to the input length.SPECTER embeddings for all types of textual neighbors are highly similar irrespective of whether the textual neighbor is semantically dissimilar or not.SPECTER embeddings show sensitivity to the presence of specific keywords.Lastly, OAG-BERT embeddings of all categories of textual neighbors are highly dissimi-lar to the original title and abstract (T+A) embeddings.We believe that our insights could be used to develop better pretraining strategies for scientific document language models and also to evaluate other language models.One example for MLM (or replaced token identification) could be utilizing a weighted-penalty based loss, i.e. partially similar tokens if predicted should be penalized less in comparison to the prediction of unrelated (or dissimilar) tokens.Additionally, these insights could also be utilised by several systems that use these scientific document language models to incorporate informed strategies in downstream systems such as recommendation systems.

B Textual Neighbors
We present examples for textual neighbors in Table 7.We use NLTK for preprocessing text and constructing textual neighbors.

C Summary of Scientific LMs
We discuss some popular scientific document language models which leverage the transformer architecture.
SciBERT (Beltagy et al., 2019) is also a BERT model trained on large amounts of scientific data.It is trained on a random sample of 1.14M papers from the Semantic Scholar Corpus.The training corpus consists of 18% papers from the computer science domain and 82% from the broad biomedical domain.Full texts of the papers are used for training.
SPECTER (Cohan et al., 2020) uses citationinformed Transformers to generate general-purpose vector representations of scientific documents.Unlike traditional models, they also leverage interdocument relatedness to learn general purpose embeddings that are effective across a variety of downstream tasks without task-specific finetuning.SPECTER leverages citations as a signal for document-relatedness and formulate this into a triplet-loss pre-training objective.SPECTER achieves state-of-the-art results on six out of seven document-level tasks for scientific literature in the SCIDOCS (Cohan et al., 2020) benchmark suite.
OAG-BERT (Liu et al., 2021) jointly model texts (title and abstract of the paper) and heterogeneous academic entities (authors, research field, venues, and affiliations) to learn representations for a scientific document.The architecture is similar to BERT, however the authors employ multiple techniques to learn entity embeddings.To distinguish different textual and academic entities use entity type embeddings to indicate the entity type.They design an entity aware 2D-positional encoding to indicate the inter-entity and the intra-entity sequence order.It also proposes span-aware entity masking to preserve the sequential relationship between the entity's tokens.
BioBERT (Lee et al., 2020) is a BERT model pre-trained on large-scale biomedical corpora.BioBERT model is initialized with BERT cite weights and then pre-trained on PubMed abstracts and PMC full-text articles.Succeeding the BioBERT model, several models have been trained exclusively for Biomedical texts such as ClinicalBERT (Huang et al., 2019), MIMIC-BERT (Si et al., 2019), PubMedBERT (Gu et al., 2020), and BioMegatron (Shin et al., 2020) to list a few.However, in this work, we focus on general purpose scientific language models that have been trained on scientific documents from diverse research fields.
We summarize the details of the SciBERT, SPECTER, and the OAG-BERT model in Table 8.

C.1 Non Transformer-based Models
Majority of Non Transformer based models utilise the Paragraph Vector (Le and Mikolov, 2014) technique to learn vectors for the textual content.Citation networks are utilised to learn similar embeddings for related papers.Paper2Vec (Ganguly and Pudi, 2017) learns embeddings by applying DeepWalk (Perozzi et al., 2014) on an augmented citation network of papers.Apart from connecting cited papers, the augmented network also connects k nearest neighbors (from textual embeddings generated using Paragraph Vector (Le and Mikolov, 2014)).Paper2Vec (Tian and Zhuo, 2017) learns distributed vertex embeddings from matrix factorization on the weighted context definition of nodes.Following a similar technique as Paper2Vec (Ganguly and Pudi, 2017), VOPRec (Vector Representation Learning of Papers with Text Information and Structural Identity for Recommendation) (Kong et al., 2021) learns embeddings from the text using Paragraph Vector (Le and Mikolov, 2014) and the citation network using Struc2Vec (Ribeiro et al., 2017).Zhu et al. (2019) present a method to learn scholar paper embeddings (Represent Anything from Scholar Papers) from different scholar entities such as title, authors, publication venue, and citations.It trains the model by trying to maximize the likelihood of references of a paper.It uses an encoder-decoder framework to learn representations from title words, author names, publication venue, and publication year.The proposed method can generate representations for papers even if the references are missing as that information is already encoded in the entities during the training.
As the pretrained models or code for none of the Non Transformer-based models is publicly available, we skip their evaluation in this work.

Code
Form Example

LL-PS T_ARot T → Preserve A → Rotate
A: We introduce a representation for computer programs based on language models.We train deep robust embeddings using pytorch.Contextual embeddings are common in NLP.

LL-PS
T_AShuff T → Preserve A → Shuffle A: Contextual embeddings are common in NLP.We introduce a representation for computer programs based on language models.We train deep robust embeddings using pytorch.

LL-PS
T_ASortAsc T → Preserve A → Sort Ascending A: Contextual embeddings are common in NLP.We train deep robust embeddings using pytorch.We introduce a representation for computer programs based on language models.

LL-PS T_ASortDesc T → Preserve A → Sort Descending
A: We introduce a representation for computer programs based on language models.We train deep robust embeddings using pytorch.Contextual embeddings are common in NLP.

LO-PS
T_ADelRand T → Preserve A → Random word deletion 30% A: Contextual are common in .introduce a representation computer programs based on language models.We train deep robust using .

LO-PS T_ADelADJ T → Preserve A → Delete all ADJs
A: We introduce a representation for computer programs based on language models .We train embeddings using pytorch .

LO-DS T_ADelNN T → Preserve A → Delete all NNs
A: Contextual are common in .We introduce a for based on .We train deep robust using .

LO-PS T_ADelVB T → Preserve A → Delete all Verbs
A: Contextual embeddings common in NLP .We a representation for computer programs on language models .We deep robust embeddings pytorch .

LO-PS T_ADelADV T → Preserve A → Delete all ADVs
A: Contextual embeddings are common in NLP .We introduce a representation for computer programs based on language models .We train deep robust embeddings using pytorch .

LO-PS T_ADelPR T → Preserve A → Delete all PRs
A: Contextual embeddings are common in NLP .introduce a representation for computer programs based on language models .train deep robust embeddings using pytorch .

LO-HS T_ADelDT T → Preserve A → Delete all DTs
A: Contextual embeddings are common in NLP .We introduce representation for computer programs based on language models .We train deep robust embeddings using pytorch .

LO-PS T_ADelNum T → Preserve A → Delete all Numbers
A: Contextual embeddings are common in NLP .We introduce a representation for computer programs based on language models .We train deep robust embeddings using pytorch .

T_ADelNNPH T → Preserve A → Delete all NN Phrases
A: Contextual embeddings are common in NLP.We introduce a representation for based on .We train deep using pytorch.We introduce a representation for computer programs based on language models.We train deep robust embeddings using pytorch.

LO-PS T_ADelQ2
T → Preserve A → Delete quantile 2 A: Contextual embeddings are common in NLP.We train deep robust embeddings using pytorch.

LO-PS T_ADelQ3
T → Preserve A → Delete quantile 3 A: Contextual embeddings are common in NLP.We introduce a representation for computer programs based on language models.We present the results of Task I (Section 5) with normalized embeddings in Table 9.

E Analysing Scientific LMs with Textual Neighbors
We present the plots for each of the seven datasets for the experiments with textual neighbors in the following sections.

E.1 Distribution of Textual Neighbors in the Embedding Space
We present the plot for inter similarity of textual neighbor vectors in Figure 9, which depicts the maximum, minimum, mean and standard deviation of all pairs of documents for the five neighbor categories.Next, we present the percentage of document pairs for each of the 32 textual neighbor classes, whose similarity is greater than the average similarity for that particular class in Figure 10.where NN10_Ret does not improve the retrieval significantly for HS categories (LO-HS and LL-HS), however it does improve retrieval recall for PS categories (LO-PS and LL-PS).OAG-BERT performs poorly with all categories achieving NN10_Ret values less than 40%.We present in Figure 12, the NN1_Ret for each of the 32 textual neighbor classes.A close inspection reveals an interesting case for SciBERT, which has extremely low NN1_Ret values (< 10%) for one of the LL-HS category, T_A_WS whih replaces 50% whitespace characters randomly with 2-5 whitespaces.

E.3 Overlap amongst Nearest Neighbors
AOP-10 is the average overlap percentage among the 10-NN (10 nearest neighbors) of the original document embeddings (T+A) and the textual neighbor embeddings.AOP-10 distribution of the 32 textual neighbor classes is presented in Figure 13.Low overlap percentage for OAG-BERT suggests that the model falters when presented with textual neighbors and does not represent textual neighbors in the neighborhood of the original document embeddings in the embedding space.

Figure 1 :
Figure 1: Alternative-Self Retrieval schemes for (a) Sec. 5 Task I, (b) Sec. 5 Task II, and (c) Sec.6.2.Green represents the relevant candidate document for the query.The query is a subset of the relevant candidate document in schemes (a) and (b), and a textual neighbor of the relevant candidate in scheme (c).
Figure 2: t-SNE plots for T and T+A embeddings for the ICLR dataset.Completely non-overlapping embeddings for T and T+A by SciBERT model highlight differences in encoding texts of varying lengths.

Figure 3 :
Figure3: Percentage of pair of documents for each Textual Neighbor whose similarity is greater than the average similarity.OAG-BERT has high inter similarity (> 50%), i.e. more than 50% document pairs have cosine similarity greater than average similarity.

Figure 4 :
Figure 4: Inter Similarity of Textual Neighbor Vectors.Bold lines represent the µ and the σ of pairwise similarities.Arrow heads represent min and max values.Pairwise similarities are spread out in a broad range for OAG-BERT, suggesting vectors for textual neighbors are more spread out in the vector space.

Figure 5 :
Figure 5: The bottom and the stacked bars represent NN1_Ret and NN10_Ret respectively.Results suggest that SciBERT embeddings for textual neighbors of scientific text are the most optimal.

Figure 6 :
Figure 6: Distribution of NN1_Ret for each textual neighbor category.SciBERT embeddings preserve the hierarchy of NN1_Ret, i.e.PS categories (LO-PS and LL-PS) have lower values than HS categories (LO-HS and LL-HS).

Figure 7 :
Figure 7: AOP-10 distribution of all categories of textual neighbors.SciBERT performs poorly for LL-PS (consists of neighbors that scramble abstract sentences).If we ignore LO-DS category, SPECTER embeddings perform decently overall.
T → Preserve A → Delete top 50% NPs A: Contextual embeddings are common in NLP.We introduce a representation for based on .We train deep robust embeddings using pytorch.

Figure 11
Figure 11 presents the NN1_Ret and NN10_Ret for each of the datasets for the five textual neighbor

Figure 9 :
Figure 9: Inter Similarity of Textual Neighbor Vectors.The pairwise similarities among documents are spread out in a significantly broader range for OAG-BERT than SciBERT and SPECTER, suggesting that OAG-BERT vectors for different textual neighbors are more spread out in the vector space.

Figure 10 :
Figure10: Percentage of pair of documents for each Textual Neighbor whose similarity is greater than the average similarity.OAG-BERT shows high inter similarity (greater than 50%) among all textual neighbors suggesting that more than 50% document pairs have a cosine greater than average similarity.

Figure 11 :
Figure 11: Stacked bars representing the percentage of documents of each textual neighbor category which retrieve the original document in the top-1 nearest neighbor and the top-10 nearest neighbor respectively.Results suggest that SciBERT embeddings for textual neighbors of scientific text are the most optimal.

Figure 12 :
Figure 12: Distribution of NN1_Ret for each textual neighbor category.It can be observed that SciBERT embeddings preserve the hierarchy of NN1_Ret, i.e.Partially Similar categories (LO-PS and LL-PS) have lower values than Highly Similar categories (LO-HS and LL-HS).

Figure 13 :
Figure13: AOP-10 distribution of all categories of textual neighbors.It can be observed that SciBERT performs poorly for the LL-PS category (involves neighbors that scramble sentences such as arranging randomly, or arranging in increasing or decreasing order of sentence length).For the rest of the categories, SciBERT embeddings show desirable order of AOP-10 values, e.g.LL-HS > LO-HS.SPECTER values have high AOP-10, which is desirable, except for the LO-DS category.

Table 4 :
AOP-10 values for different categories.The best results for LO-DS category are from OAG-BERT (OB), however that is because the model performs poorly on all categories of textual neighbors.Similarly, best results for the rest four categories are from SPECTER (SP), following which it also has a high overlap percentage for the LO-DS category.SciBERT (SB) embeddings perform the best for HS and DS but falter on PS semantic categories.

Table 4
textual neighbor categories.The five categories arranged in increasing order of semantic similarity are: LL-HS ≥ LO-HS > LL-PS > LO-PS > LO-DS.We use heuristic-based values to define optimality.For each of the five categories, we define AOP-20 thresholds to classify if the textual neighbor representations for the corresponding are optimal or not.It is expected that AOP-20 values for semantic categories should be in order: HS > PS > DS.AOP-20 values for orthographic categories should follow: LL > LO.

Table 6 :
Candidate documents retrieved for queries'document vector' and 'document vectors'.
A: Contextual EMBEDDINGS are common in NLP .We introduce a REPRE-SENTATION for COMPUTER PROGRAMS based on LANGUAGE MODELS .We train deep robust EMBEDDINGS using PYTORCH .

Table 7 :
Neighbor code is in format Txx_Ayy_zz, where xx and yy denote the perturbation to the paper title (T) and the abstract (A) respectively.zz denotes perturbation to both T and A. Missing T or A denotes that the corresponding input field is deleted completely.

Table 8 :
Comparison of different transformer based language models for scientific literature

Table 9 :
L2 normalization leads to an incremental improvement in performance.Standardization leads to improvement for SciBERT and SPECTER models, but the same effect is not observed for OAG-BERT embeddings.