Impact of Sample Selection on In-Context Learning for Entity Extraction from Scientific Writing

,


Introduction
Extracting relevant information from scientific documents plays a crucial role in improving methods for organising, indexing, and querying the vast amount of existing literature (Nasar et al., 2018;Weston et al., 2019;Hong et al., 2021).However, annotating datasets for scientific information extraction (IE) is a laborious and costly process that requires the expertise of human experts and the development of annotation guidelines.
In recent years, large language models (LLMs) have demonstrated remarkable performance on various natural language processing (NLP) tasks (Wei et al., 2022;Hegselmann et al., 2023;Ma et al., 2023), including entity extraction from scientific documents (Dunn et al., 2022), and also for leveraging reported scientific knowledge in downstream data science applications (Sorin et al., 2023;Vert, 2023).These models, such as GPT-3 (Brown et al., 2020) and LLAMA (Touvron et al., 2023), with billions of parameters and pre-trained on vast amounts of data, have showcased impressive capabilities to tackle tasks in a zero-or few-shot learning by leveraging in-context learning (ICL) (Radford et al., 2019;Brown et al., 2020).
In ICL, models are provided with a natural language prompt consisting of three components: a format, a set of training samples (input-label pairsdemonstrations), and a test sentence.LLM outputs the predictions for a given test input without updating its parameters.The main advantage of ICL is its ability to use the pre-existing knowledge of the language model and generalise from a small number of context-specific samples.However, ICL has been shown to be sensitive to the provided samples and randomly selected samples have been shown to introduce significant instability and uncertainty to the predictions (Lu et al., 2021;Chen et al., 2022;Agrawal et al., 2022).This issue can be alleviated by optimising the selection of the in-context samples (Liu et al., 2021;Sorensen et al., 2022;Gonen et al., 2022).
ICL sample selection methods can be divided into 2 categories: (1) the methods for choosing samples from the train set (e.g., the KATE method (Liu et al., 2021)), and (2) finding the best prompts by generating samples (e.g., the Perplexity method (Gonen et al., 2022), SG-ICL (Kim et al., 2022)).These methods can significantly reduce the need for extensive human annotation and allow LLMs to adapt to various domains and tasks.
We rely on the survey of ICL (Dong et al., 2022) and delimit the methods for sample selection, from the inference stage of ICL.Our aim is to provide a comprehensive analysis of these methods for se-lecting samples from the train set as part of ICL for Entity Extraction from scientific documents.Most of the methods have been applied with prompt generation (i.e., to select the best generated sample).Here, we use the methods only for sample selection from the training set of the dataset for entity extraction from scientific documents and compare their effectiveness for this problem.We also propose the use of the Influence method (Koh and Liang, 2017) in an oracle setting, to provide a best-case scenario to compare against.We investigate the in-context sample selection methods (see §3) and evaluate the methods adapted for entity extraction problem on 5 entity extraction datasets: ADE, MeasEval, SciERC, STEM-ECR, and WLPC, each covering a different scientific subdomain or text modality (see §4.1 for dataset overview).
Our experiments show that while fully supervised finetuned PLMs are still the gold standard when training data can be sourced, choosing the right samples for ICL can go a long way in improving the effectiveness of ICL for scientific entity extraction (see §5.1).Our experiments demonstrate an improvement potential of 7.56% on average across all experiments, when comparing the oracle method (the Influence method) to the random sample selection baseline, and 5.26% when using the best-performing method in a test setting (KATE).Moreover, our evaluations show that our main conclusions hold in a simulated low-resource setting (see §5.2).Finally, our extensive experiments allow us to synthesise some prescriptive advice for other NLP researchers and practitioners tackling scientific entity extraction (see § 5.5).

Related Work
By increasing the size of both the model and the corpus, LLMs have demonstrated the capability of ICL, which uses pre-trained language models for new tasks without relying on gradient-based training (Brown et al., 2020).In various tasks, such as inference (ibid), machine translation (Agrawal et al., 2022), question answering (Huang et al., 2023;Shi et al., 2023), table-to-text generation (Liu et al., 2021) and semantic parsing (An et al., 2023), the ICL use of LLMs mentioned by Brown et al. (2020) has been shown to be on par with supervised baselines in terms of effectiveness.
Other studies have found, however, that ICL does not always lead to better results than finetuning.Previous studies investigating ICL for IE are very limited (Gutiérrez et al., 2022;Wan et al., 2023).Gutiérrez et al. (2022) evaluate the performance of ICL on biomedical IE tasks, Named Entity Recognition (NER) and Relation Extraction (RE).In addition, Wan et al. (2023) apply an entityaware demonstration using the kNN sample selection method (Liu et al., 2021) for RE.
To the best of our knowledge, our work is one of the first attempts for IE from scientific documents that present a comprehensive analysis of in-context sample selection methods for the problem with detailed analysis.

Methods
In this section, we describe the ICL sample selection methods for entity extraction from scientific documents.First, we describe the ICL approach in Section 3.1 and then introduce the sample selection methods in Section 3.2.

In-context Learning
Given an LLM, ICL can be used to solve the entity extraction problem for D = (X, Y ), where X are the sentences (s = w 1 , • • • , w n ) and Y are the entities for each sentence.The prompt P consists of k, the number of samples for the few-shot learning, samples (T ) (selected from the train set or generated; in this work, we focus only on the former) with gold entities (T (s train l , e train l ) is the l th sample) with a format (I) and a test sentence (s i ) (P = I + T + s test i ) (see Appendix B).Prediction is done by selecting the entities with the highest probability for each sentence in the test set.

Sample Selection Methods
We follow the survey in-context learning (Dong et al., 2022) and choose the following methods to use for sample selection for ICL entity extraction from scientific documents.

KATE (Knn-Augmented in-conText
Example selection) is a kNN-based method to select k samples which are close to test sample based on sentence embeddings and distance metrics (Euclidean or Cosine Similarity).We follow KATE to select samples from the train set of datasets for each sentence in the test set.
Perplexity is a metric to evaluate the performance of language models by calculating the probability distribution of the next token given the content provided by the preceding tokens.The metric provides insights into the unexpectedness of a sentence in the context of a given language model.Gonen et al. (2022) use perplexity scores of prompts to select the best prompt, rather than selecting examples from the dataset, and synthetically generated prompts through paraphrasing with GPT-3 and back-translation.Unlike Gonen et al. (2022), in the experiments we focus on selecting in-context samples from the training set instead of selecting the better prompt.As the sample selection method, we calculate the perplexity of each train sentence using a language model (LM) and take the k samples from the train set with the lowest perplexity, which means the sentence is more likely and consistent with the patterns it has learned from the training data of LM.Unlike the other in-context sample selection methods (Random, KATE, etc.), the selection of the k samples is independent of the test sentences (i.e., the same samples from the train set are characterised by lower perplexity, independently from the test sample presented alongside).
BM25 is a bag-of-words retrieval model that ranks relevant samples (sentences) appearing in each train set by relevance to a given test sample (Schutze et al., 2008;Robertson et al., 2009).Similar to retrieval-based methods for augmentation of the input with similar samples from the train set (Xu et al., 2021;Wang et al., 2022b), we select k most relevant samples from the train set (so, those with higher BM25 scores) for each test sentence in the experiments.
Influence functions (Koh and Liang, 2017) were originally used in statistics for the context of linear model analysis (Cook and Weisberg, 1982;Chatterjee and Hadi, 1986;Hampel et al., 1986) The influence method is used in the literature to detect errors in the dataset and to create adversarial training samples (Koh and Liang, 2017).We adapted Influence as a method to study potential performance gains for ICL sample selection because it scores the contribution of a sample to the training process.Similar to in-context sample selection methods, we select k samples from the train set that have a higher influence on sentences from the test set by using the baseline finetuned RoBERTa model (see Section 4.2) as the model to calculate the loss in the experiments.Since the Influence method's practical applicability is limited (it uses test labels to select the ICL samples via the loss), we use it as a best-case (or oracle) baseline, where the sample ranking is based on training utility, rather than a vocabulary similarity signal.

Datasets
We evaluate the sample selection methods in ICL for entity extraction from scientific documents.We use 5 datasets from the different subdomains: Statistical details of datasets are given in Table 1.

Baseline Methods
In our experiments, we compare ICL sample selection methods with a finetuned pre-trained language model, RoBERTa, zero-shot learning in which no samples are used for the GPT-3.5 prompt, and random sampling in which samples are randomly selected for the prompt.
Finetuned RoBERTa baseline To compare the sample selection methods in ICL against a sensible baseline, we trained an entity extraction model on the datasets using RoBERTa (Liu et al., 2019) PLM (RoBERTa-base).We formulate the fully tuned task as token-level labelling using the BIO tags.
Zero-Shot For zero-shot setup, we formulate prompts using only format (I; see Appendix B) and test sentences from the test sets for each dataset.

Random Sampling
In this approach, we randomly select k in-context samples from the train set for every test sentence.

Experimental Setup
Baseline RoBERTA PLM is finetuned utilising Hugging Face7 (Wolf et al., 2020)  For the baseline, zero-shot and random sampling, and ICL sample selection experiments, we build the system using the EasyInstruct8 (Ou et al., 2023) framework to instruct LLMs for entity extraction from scientific documents with defined entity extraction prompts and entities of the datasets.In the experiments for ICL sample selection, we use a maximum of 20 in-context samples due to the GPT-3 (gpt-3.5-turbo-0301)token limit and 100 sentences from each test set because of the cost of GPT-3.5 usage.The experiment is repeated five times on the test set to calculate the average score and corresponding standard deviation for random sampling (see detailed results in Appendix D).
For the KATE, we use [CLS] token embeddings of the RoBERTa PLM and OpenAI embedding API (text-embedding-ada-002) to obtain sentence embeddings.We treat the embedding generation method (RoBERTa vs. GPT) as another hyperparameter (much like the number of samples k).We calculate the distance between embeddings using the Euclidean and cosine similarity metrics for each test sentence and select similar k sentences based on the distance scores in KATE.We calculate the  † denotes statistical significance level of p = 0.05 compared to the supervised RoBERTa baseline (RoBERTa %1 ) and ‡ denotes statistical significance level of p = 0.05 compared to the random sampling (Random %1 ) for low-resource scenario.
perplexity of the samples from the train set by using the RoBERTa PLM (using the method outlined in (Salazar et al., 2019)) and select k samples with the lowest perplexity for all test sets of the datasets in the Perplexity method.For BM25, we utilise rank-bm25 9 library with default parameters (term frequency saturation -k1 of 1.5, document length normalisation -b of 0.75, and constant for negative IDF of a sentence in the dataϵ of 0.25).We use the finetuned RoBERTa to select k samples, as defined in the study of Jain et al. (2022), for each test sentence in the Influence method.
As the evaluation metric, we use entity-level Macro F 1 score. 9https://pypi.org/project/rank-bm25/Statistical significance The statistical significance of differences in macro F 1 score is evaluated with an approximate randomisation test (Chinchor, 1992) with 99, 999 iterations and significance level α = 0.05 for sample selection methods (KATE, Perplexity, BM25, and Influence) and supervised RoBERTa baseline model and the random sampling (e.g., influence → RoBERTa and influence → random sampling).For significance testing, we used the results yielding the median entity-level Macro F 1 score for the supervised RoBERTa baseline model and the random sampling (so, a run close to the mean value reported in the tables).

Main Findings for Selecting In-context Samples
Our main experimental results are given in Table 2 for randomly selected 100 sentences from each of the test sets of the datasets (see Section 4.1) for entity extraction.Detailed experiments with various k samples in ICL can be found in Appendix D.
Before drilling down into the in-context sample selection methods, we note that the baseline model, RoBERTa, outperforms the ICL for entity extraction from scientific documents across all datasets except WLPC, similar to the study of Gutiérrez et al. ( 2022) conducted on Biomedical IE.We get the highest entity-level Macro F 1 score among sample-selection methods for all datasets using the Influence method.Additionally, the performance of sample selection methods is low for the Measeval, SciERC, and STEM-ECR datasets, and the gap between the results of finetuned RoBERTa baseline and the Influence method is very large for these datasets.This difference in performance may be due to the difficulty of the datasets (Sci-ERC, STEM-ECR) and the differences between train and test sets of the datasets (Measeval) (see Appendix A for a detailed analysis).
The Influence method performs comparably with the RoBERTa model for the ADE dataset.Moreover, despite the complexity of the WLPC dataset with 18 entity types, it is surprising that the effectiveness of zero-shot and ICL is better than that of the finetuned RoBERTa model.We hypothesise that this might be due to the method selecting samples from the correct minority classes.Interestingly, the textual similarity signal is almost as good, as the results of both BM25 and KATE are almost as good.

Low-Resource Scenario
To understand how important the size of the training set is for fully supervised finetuning of the baseline PLM model, RoBERTa, and sample selection methods for ICL, we run the experiments with 1% of the train set to simulate a low-resource scenario.The results can be found in Table 3.Although there is a decrease in the results of ICL for all datasets, it is much less drastic than for the supervised models, which is not surprising.It is well known that a sufficient amount of annotated data is needed to finetune PLM.Therefore, the robustness of ICL methods is a valuable finding that can be applied to low-resource problems without annotated data (zero-shot) or with very small train sets (few-shot using selected samples).

Test Set
To understand the impact of the test set in the experiments, we used 3 different randomly sampled test sets.We present the results for the ADE and WLPC datasets (see Appendix C for statistical details of test sets), where ICL methods perform competitively with the fully supervised baseline.The results can be found in Table 4 and

Error Analysis
In Table 6, we give the entity-type-wise entitylevel Macro F 1 score for the datasets for each ICL method and baseline models.The detailed error analysis of the Influence method -our oracle method -shows that there are 2 types of errors in the predictions: (1) correct entity type -wrong entity span, where the model predicts an entity with correct entity type that is not annotated in the dataset, (2) wrong entity type -wrong entity span, where the model predicts an entity with a wrong entity type.The visualisation of the sample 15 sentences for error analysis can be found in Appendix E.
For the ADE dataset, all models perform better for the Drug entity type.The reason may be the shorter entity length (Adverse-Effect: 18.85, Drug: 10.27) and small vocabulary (Adverse-Effect: 2,786, Drug: 1,290), although the frequency of Adverse-Effect is higher than Drug in the train set and also in the selected samples in each in-context sample selection method.Unlike other datasets, we also encounter predictions with entity types that are not present in the ADE dataset (e.g., Disease, Number, Route).
For the MeasEval dataset, the most common error is the mislabeling of spans corresponding to other entity types (Measured Property, Measured Entity, and Qualifier, which are left out in this study) as Quantity entities, e.g., Qualifier as Quantity (a more specific example: 'total counts per gram' predicted as Quantity, instead of the correct entity type -Qualifier).Another conclusion from the error analysis for the Measeval dataset is that GPT-3.5 tends to predict entity spans that are longer than the gold ones (e.g., gold: '11%'predicted: 'axis 2 =11%').
Results from the SciERC dataset show that ICL with sample selection methods struggles in the prediction of less frequent entity types (Generic, Material, Metric, Task) compared to entity types with higher frequency.In particular, Other is the most frequent entity type in the dataset and GPT-3.5 often extracts a correct span and mislabels it as Other entity type.In addition, the average sentence length of SciERC is higher than the other datasets.However, the number of entities is less than the other datasets, and the Influence method tends to retrieve samples with more entities than the whole dataset.This results in extracting entities that are not actually entities in the dataset.
For the STEM-ECR dataset, the Influence method is able to extract the correct spans.How-ever, it has difficulty in accurately labelling the spans because the dataset is imbalanced.The frequency of the Material and Process entity types is higher, which leads the Influence method to select samples with these entities and consequently label the extracted entities with these entity types.
Finally, the WLPC dataset is very dense in terms of entities in the sentences, despite the sentence length.Since the dataset is imbalanced (the entity types Action, Reagent, Amount, and Location occur more frequently than others), the Influence method retrieves samples covering these entities and, as a result, extracts mainly these entities.Moreover, the dataset is composed of instructional text and the Action entity is mostly a verb in the sentence, which is easy to extract and correctly label.

Discussion
In practical applications, one may not have enough annotated data to finetune PLM for a task.In such cases, it might be required to use ICL for the problem.Therefore, we explore the performance of the sample-selection methods which can be more effective in this case.First, we note that the random sampling method given in baseline methods is also competitive, especially in the low-resource scenario (see Section 5.2).
Among the sample selection methods, we obtain the best results for ADE and WLPC with sentences coming from the [CLS] token of finetuned RoBERTa (finetuned using the train set of datasets), for the SciERC, STEM-ECR, and Mea-sEval datasets, we obtain the best results with Ope-nAI embeddings for the KATE method.This may be due to the insufficient training set for these tasks since we use the embeddings from finetuned RoBERTa (which is also used as the baseline model in the study).On the other hand, using OpenAI embeddings in sample selection, despite being costly, avoids the pitfall of needing enough annotated training data to train a supervised model in order to be able to select samples for ICL (although, admittedly, even very under-trained PLM appear to be effective for sample selection; see further in this section).
We calculated the perplexity of sentences using pre-trained and finetuned RoBERTa language models for the Perplexity method, and we obtained better results using the finetuned RoBERTa, which highlights the benefits of domain-adaptation of a language model for the entity extraction problem (but, again, points to the issue of needing a decent amount of training data to eventually train a fewshot model).The BM25 method, however, is very simple and effective for each of the datasets, without relying on any finetuned model (or any training, for that matter) for ICL sample selection.
Using these methods in selecting samples from a very limited training set (see Section 5.2) and testing on different test sets (see Section 5.3) shows that the methods are more robust compared to the baseline model, finetuned RoBERTa.In particular, our experiments in a simulated low-resource setting show that RoBERTa tuned with just 1% of the train set can be used effectively to improve ICL sample selection (e.g., via the KATE method), while performing very poorly on the actual prediction task.It is very valuable learning applicable to subdomains without annotated data or with very limited annotated datasets.
When we analyse the main results (see Table 2) and the results of the low-resource scenario (see Table 3), we find that KATE performs better in a data-poor set-up where the number of samples is severely limited.This shows that KATE has a remarkable ability to order a suboptimal subset of incontext samples.This suggests that KATE derives meaningful insights from limited data, making it a valuable method when data scarcity is a challenge.Also, BM25 offers an effective and efficient mechanism for sample selection that can be utilised in a true few-shot setup.
Another observation is that the Influence method, a classic technique from statistics, proves highly effective in selecting samples from a larger pool of samples.The method evaluates the impact of a training sample by assessing its effect on loss, typically the loss of test samples.While it is an oracle method, its high effectiveness highlights a performance gap between a loss-based signal and sample-similarity-based signal.We believe that bridging this gap is a challenge worth exploring in future research into ICL sample selection methods.However, it should be noted, that the effectiveness of Influence decreases in extreme few-shot setup, possibly due to a high training variance caused by a very small number of instances.This, in turn, highlights the robustness of KATE and BM25.BM25, as a keyword-matching method, does not require training (we used default hyperparameters in all experiments).KATE can fall back on a PLM's ability to create text embeddings to overcome the training data scarcity, instead of relying on the loss signal produced with the under-trained layers of the model (i.e., the classification head).

Conclusion
In this paper, we explore the in-context sample selection methods for ICL entity extraction from scientific documents.Since entity extraction is a crucial step in IE from scientific documents, we analyse the methods in detail using several datasets from different subdomains, and with different entity types.The experimental results show that the baseline model, finetuned RoBERTa, still achieves the best results for this problem on 4 of 5 datasets.However, the in-context sample selection methods appear to be more robust to the train set data availability and achieve similar results to using a full train set when only a small annotated training set is used for the problem, yielding significantly better results than the baseline model in this low-resource setup.
Our work aims to extract entity spans using LLM with ICL.We focus on simple in-context sample selection methods based on similarity, perplexity, relevancy, and influence, and use GPT-3.5 as LLM in ICL.However, there are several alternative LLMs pre-trained on different domains, that could be more aligned with the task of scientific entity extraction.As future work, we hope to add a comparative dimension to our work by using these LLMs, since the ICL behaviour of LLMs can change depending on their scale and pretraining.We also plan to explore the performance of the in-context sample ordering methods (Lu et al., 2021), which are shown to impact the ICL effectiveness as well.

Limitations
We investigate the impact of the ICL selection methods for entity extraction from scientific domains.Although we tested several methods on various datasets from different subdomains, due to the high cost of LLM models, we limited our experiments to a small subset of test sets and used only GPT-3.5.Moreover, the methods, KATE, Perplexity, and Influence (an oracle method), require finetuned models for better performance in selecting samples from the annotated dataset.In addition, we did not investigate which instruction is most appropriate.We also did not directly investigate the ordering of the selected samples, also shown to have impact of effectiveness for related NLP problems (Lu et al., 2021;Rubin et al., 2021).Moreover, k is a hyperparameter in few-shot learning that depends on the sample selection method and the dataset.We tested directly on the test set without using a validation set.Finally, we did not apply contextual calibration (Zhao et al., 2021) for entity extraction, which has been shown to improve the performance of contextual learning for NLP tasks, and kept this as future work.entity types for ICL methods.It can also be seen that the TC values of the SciERC and STEM-ECR datasets are higher than those of the other datasets.
In addition to the difficulty metrics, the TVC similarity metric calculates the similarity of the tokens in the training and test datasets and shows that the MeasEval test set is less similar to the train set compared to the other datasets.

B Prompt Template
For the experiments, we use the prompt format (I) of the EasyInstruct framework defined for the Named Entity Extraction (NER) task.The prompt used in zero-shot and few-shot learning is given Figure 1 with the illustration of ICL for entity extraction.

C Test Set Details
Test set details used in Section 5.3 are given in Table 8.

D In-Context Learning Experiments
The experimental results with various k samples in ICL conducted for 100 sentences can be found in Table 9.

E Visualization of Entities
The visualization of errors made by the Influence method with gold entities for 15 sentences are given in Table 10, 11, 12, 13 and 14 for ADE, Mea-sEval, SciERC, STEM-ECR, and WLPC datasets, respectively.We use different colours except green to highlight the entity types and we highlight the wrong entity type even if the extracted entity is correct, and the wrong extracted or wrong labeled entity with green, in the prediction of Influence method.

S1 -Gold
Gemcitabine Drug -induced pulmonary toxicity AE is usually a dramatic condition .

S1 -Influence
Gemcitabine Drug -induced pulmonary toxicity AE is usually a dramatic condition .

S2 -Gold
Peripheral neuropathy AE associated with capecitabine Drug .

S2 -Influence
Peripheral neuropathy AE associated with capecitabine Drug .

S3 -Gold Two cases of mequitazine
Drug -induced photosensitivity reactions

S4 -Gold
When a segment is found to be an NE items O , this information is added to the segment and it is used to generate the final output .

S4 -Influence
When a segment O is found to be an NE items O , this information is added to the segment and it is used to generate M ethod the final output Generic .

S5 -Gold
Requestors can also instruct the system Generic to notify them when the status of a changes or when a request is complete .

S5 -Influence
Requestors can also instruct the system

S11 -Gold
The request is passed to a mobile , intelligent agent M ethod for execution at the appropriate database .

S11 -Influence
The request T ask is passed to a mobile Generic , for execution T ask at the appropriate database Generic .

S12 -Gold
Each part is a collection of salient image features O .

S12 -Influence
Each part is a collection of salient image features .

S13 -Gold
We have conducted numerous simulations to verify the practical feasibility of our algorithm Generic .

S13 -Influence
We have conducted numerous simulations Generic to verify the practical feasibility of our algorithm

Table 1 :
Statistical details of datasets.Avg e is the average length of entities and Avg s is the average length of sentences.
The aim of the functions is to calculate the influence of a training sample s train on a test sample s test , formulated as the change in loss on s test , if the training sample s train were removed from training.This yields the influence of s train to solve the task for s test .

Table 3 :
Main results for methods of selecting in-context samples using %1 of train set.The best results are given in bold.The best results of the in-context sample selection method are given in underline.

Table 4 :
5 for ADE and WLPC, respectively.It can be seen that the first test set of the WLPC dataset is challenging for the baseline model, finetuned RoBERTa.However, in-context sample selection methods, with the exception of Perplexity, appear to be less affected by the test set composition and yield similar results across different test sets.Results for different test sets for ADE dataset.The best results are given in bold.The best results of the in-context sample selection method are given in underline.

Table 5 :
Results for different test sets for WLPC dataset.The best results are given in bold.The best results of the in-context sample selection method are given in underline.

Table 6 :
Entity-type-wise results of each in-context sample selection method and baseline models.

Table 7 :
Difficulty and similarity scores of datasets.

Table 8 :
Statistical details of test sets used in Section 5.3.Avg e is the average length of entities and Avg s is the average length of sentences.

Table 10 :
Selected sentences from the test set with gold and predicted entities for the ADE dataset.AE is the abbreviation of Adverse-Effect Generic in two stagesGeneric : dictionary lookup Generic and rule application O .

Table 12 :
Selected sentences from the test set with gold and predicted entities for the SciERC dataset.O is the abbreviation of Other.
Data is free of Christoffel symbols M aterial has been predicted P rocess .

Table 13 :
Selected sentences from the test set with gold and predicted entities for STEM-ECR dataset.
Reagent and mix Action by inverting Action the tube Action the tube Device 10 N umerical times to precipitate Action the DNA

Table 14 :
Selected sentences from the test set with gold and predicted entities for the WLPC dataset.