TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of Tasks Datasets and Metrics

Tasks, Datasets and Evaluation Metrics are important concepts for understanding experimental scientific papers. However, previous work on information extraction for scientific literature mainly focuses on the abstracts only, and does not treat datasets as a separate type of entity (Zadeh and Schumann, 2016; Luan et al., 2018). In this paper, we present a new corpus that contains domain expert annotations for Task (T), Dataset (D), Metric (M) entities 2,000 sentences extracted from NLP papers. We report experiment results on TDM extraction using a simple data augmentation strategy and apply our tagger to around 30,000 NLP papers from the ACL Anthology. The corpus is made publicly available to the community for fostering research on scientific publication summarization (Erera et al., 2019) and knowledge discovery.


Introduction
The past years have witnessed a significant growth in the number of scientific publications and benchmarks in many disciplines.As an example, in the year 2019 alone, more than 170k papers were submitted to the pre-print repository arXiv1 and among them, close to 10k papers were classified as NLP papers (i.e., cs.CL).Each experimental scientific field, including NLP, will benefit from the massive increase in studies, benchmarks, and evaluations, as they can provide ingredients for novel scientific advancements.
However, researchers may struggle to keep track of all studies published in a particular field, resulting in duplication of research, comparisons with old or outdated benchmarks, and lack of progress.In order to tackle this problem, recently there have been a few manual efforts to summarize the stateof-the-art on selected subfields of NLP in the form of leaderboards that extract tasks, datasets, metrics and results from papers, such as NLP-progress 2 or paperswithcode. 3But these manual efforts are not sustainable over time for all NLP tasks.
Over the past few years, several studies and shared tasks have begun to tackle the task of entity extraction from scientific papers.Augenstein et al. (2017) formalized a task to identify three types of entities (i.e., task, process, material) in scientific publications (SemEval 2017 task10).Gábor et al. (2018) presented a task (SemEval 2018 task 7) on semantic relation extraction from NLP papers.They provided a dataset of 350 abstracts and reuse the entity annotations from Zadeh and Schumann (2016).Recently Luan et al. (2018) released a corpus containing 500 abstracts with six types of entity annotations.However, these corpora do not treat Dataset as a separate type of entity and most of them focus on the abstracts only.
In a previous study, we developed an IE system to extract {task, dataset, metric} triples from NLP papers based on a small, manually created task/dataset/metric (TDM) taxonomy (Hou et al., 2019).In practice, we found that a TDM knowledge base is required to extract TDM information and build NLP leaderboards for a wide range of NLP papers.This can help researchers quickly understand related literature for a particular task, or to perform comparable experiments.
As a first step to build such a TDM knowledge base for the NLP domain, in this paper we present a specialized English corpus containing 2,000 sentences taken from the full text of NLP papers which have been annotated by domain experts for three main concepts: Task (T), Dataset (D) and Metric (M).Based on this corpus, we develop a TDM tagger using a novel data augmentation technique.In addition, we apply this tagger to around 30,000 NLP papers from the ACL Anthology and demonstrate its value to construct an NLP TDM knowledge graph.We release our corpus at https:// github.com/IBM/science-result-extractor.

Related Work
A lot of interest has been focused on information extraction from scientific literature.SemEval 2017task 10 (Augenstein et al., 2017) proposed a new task for the identification of three types of entities (Task, Method, and Material) in a corpus of 500 paragraphs taken from open access journals.Based on Augenstein et al. (2017) and Gábor et al. (2018), Luan et al. (2018) created SciERC, a dataset containing 500 scientific abstracts with annotations for six types of entities and relations between them.Both SemEval 2017-task 10 and SciERC do not treat "dataset" as a separate entity type.Instead, their "material" category comprises a much larger set of resource types, including tools, knowledge resources, bilingual dictionaries, as well as datasets.
In our work, we focus on "datasets" entities that researchers use to evaluate their approaches because dataset is one of the three core elements to construct leaderboards for NLP papers.
Concurrent to our work, Jain et al. (2020) develop a new corpus SciREX which contains 438 papers on different domains from paperswithcode.It includes annotations for four types of entities (i.e., Task, Dataset, Metric, Method) and the relations between them.The initial annotations were carried out automatically using distant signals from paperswithcode.Later human annotators performed necessary corrections to generate the final dataset.SciREX is the closest to our corpus in terms of entity annotations.In our work, we focus on TDM entities which reflect the collectively shared views in the NLP community and our corpus is annotated by five experts who all have 5-10 years NLP research experiences.

Annotation Scheme
We developed an annotation scheme for annotating Task, Dataset, and Evaluation Metric phrases in NLP papers.Our annotation guidelines 4 are 4 Please see the appendix for the whole annotation scheme.
based on the scientific term annotation scheme described in Zadeh and Schumann (2016).Different from previous corpora (Zadeh and Schumann, 2016;Luan et al., 2018), we only annotated factual and content-bearing entities.This is because we aim to build a TDM knowledge base in the future and non-factual entities (e.g., a high-coverage sense-annotated corpus in Example 1) do not reflect the collectively shared views of TDM entities in the NLP domain.
(1) In order to learn models for disambiguating a large set of content words, a high-coverage senseannotated corpus is required.
Following the above guidelines, we also do not annotate anonymous entities, such as "this task" or "the dataset".These entities are anaphors and can not be used independently to refer to any specific TDM entities without contexts.In general, we choose to annotate TDM entities that normally have specific names and whose meanings usually are consistent across different papers.From this perspective, the TDM entities that we annotate are similar to named entities, which are self-sufficient to identify the referents.

Pilot Annotation Study
Data preparation.For the pilot annotation study, we choose 100 sentences from the NLP-TDMS corpus (Hou et al., 2019).The corpus contains 332 NLP papers which are annotated with triples of {Task, Dataset, Metric} on the document level.We use string and substring match to extract a list of sentences from these papers which are likely to contain the document level Task, Dataset, Metric annotations.We then manually choose 100 sentences from this list following the criteria: 1) the sentence should contain the valid mention of Task, Dataset, or Metric; 2) the sentences should come from different papers as much as possible; and 3) there should be a balanced distribution of task, dataset, and metric mentions in these sentences.
Annotation agreement.Four NLP domain experts annotated the same 100 sentences for a pilot annotation study, following the annotation guidelines described above.All the annotations were conducted using BRAT (Stenetorp et al., 2012).The inter annotator agreement has been calculated with a pairwise comparison between annotators using precision, recall and F-score on the exact match of the annotated entities.two entities are considered matching (true positive) if they have the same boundaries and are assigned to the same label.We also calculate Fleiss' kappa on a per token basis, comparing the agreement of annotators on each token in the corpus.Table 1 lists the mean F-score as well as the token-based Fleiss' κ value for each entity type.Overall, we achieve high reliability for all categories.
Adjudication.The final step of the pilot annotation was to reconcile disagreements among the four annotators to produce the final canonical annotation.This step also allows us to refine the annotation guidelines.Specifically, through the discussion of annotation disagreements we could identify ambiguities and omissions in the guidelines.For example, one point of ambiguity was whether a task must be associated with a dataset, or can we annotate higher level tasks, e.g., sequence labeling, which do not have a dedicated dataset but may include several tasks and datasets.This discussion also revealed the overlap in how we refer to tasks and datasets in the literature.As authors we frequently use these interchangeably, often with shared tasks, e.g., "SemEval-07 task 17" seems to more often refer to a dataset than a specific instance of the (Multilingual) Word Sense Disambiguation task, or the "MultiNLI" corpus is sometimes used as shorthand for the task.After the discussion, we agreed that we should annotate higher level tasks.In addition, we should assign labels to entities according to their actual referential meanings in contexts.

Main Annotation
After the pilot study, 1,900 additional sentences were annotated by five NLP researchers.Four annotators participated in the pilot annotation study, and all annotators joined the adjudication discussion.Note that every annotator annotate different set of sentences.The annotator who designed the annotation scheme annotated 700 sentences, the other four annotators annotated 300 sentences each. 5n general, most sentences in our corpus are not from the abstracts.Note that the goal of developing our corpus is to automatically build an NLP TDM taxonomy and use them to tag NLP papers.Therefore, the inclusion of sentences from the whole paper other than the abstract section is important for our purpose.Because not all abstracts talk about all three elements.For instances, for the top ten papers listed in the {sentiment analysis, IMDB, ac-curacy} leaderboard in paperswithcode6 , only four abstracts mention the dataset "IMDB".If we only focus on the abstracts, we will miss the other six papers from the leaderboard.

A TDM Entity Tagger
Our final corpus TDMSci contains 2,000 sentences with 2,937 mentions of three entity types.We convert the original BRAT annotations to the standard CoNLL format using BIO scheme. 7We develop a tagger to extract TDM entities based on this corpus.

Experimental Setup
To evaluate the performance of our tagger, we split TDMSci into training and testing sets, which contains 1,500 and 500 sentences, respectively.Table 2 shows the statistics of task/dataset/metric mentions in these two datasets.For evaluation, we report precision, recall, F-score on exact match for each entity type as well as micro-averaged precision, recall, F-score for all entities.

Models
We model the task as a sequence tagging problem.
We apply a traditional CRF model (Lafferty et al., 2001)  CRF.We use the Stanford CRF implementation (Finkel et al., 2005) to train a TDM NER tagger based on our training data.We use the following features: unigrams of the previous, current and next words, current word character n-grams, current POS tag, surrounding POS tag sequence, current word shape, surrounding word shape sequence.
CRF with gazetteers.To test whether the above CRF model can benefit from knowledge resources, we add two gazetteers to the feature set: one is a list containing around 6,000 dataset names which were crawled from LRE Map,8 and another gazetteer comprises around 30 common evaluation metrics compiled by the authors.
SciIE.Luan et al. (2018) proposed a multi-task learning system to extract entities and relations from scientific articles.SciIE is based on span representations using ELMo (Peters et al., 2018) and here we adapt it for TDM entity extraction.Note that if SciIE predicts several embedded entities, we keep the one that has the highest confidence score.
In practice we notice that this does not happen in our corpus.
Flair-TDM For BiLSTM-CRF model, we use the recent Flair framework (Akbik et al., 2018) based on the cased BERT-base embeddings (Devlin et al., 2018).We train our Flair-TDM model with a learning rate of 0.1, a batch size of 32, a hidden size of 768, and the maximum epochs of 150.

Data Augmentation
For TDM entity extraction, we expect that the surrounding context will play an important role.For instance, in the following sentence "we show that for X on the Y, our model outperforms the prior state-of-the-art", one can easily guess that X is a task entity while Y is a dataset entity.As a result, we propose a simple data augmentation strategy that generates the additional mask training data by replacing every token within an annotated TDM entity as UNK.

Results and Discussion
Table 3 shows the performance of different models for task/dataset/metric entity recognition on our testing dataset.First, it seems that although adding gazetteers can help the CRF model detect dataset and metric entities better, the positive effect is limited.In general, both SciIE and Flair-TDM perform better than CRF models for detecting all three type of entities.
Second, augmenting the original training data with the additional masked data as described in Section 4.3 further improves the performance both for SciIE and Flair-TDM.However, this is not the case for the CRF models.We assume this is because CRF models heavily depend on the lexical features.
Finally, we randomly sampled 100 sentences from the testing dataset and compared the predicted TDM entities in Flair-TDM against the gold annotations.We found that most errors are from the boundary mismatch for task and dataset entities, e.g., text summarization vs. abstractive text sum- Treebank" to refer to a dataset.So the model will learn this bias and only tag "Penn Treebank" as the dataset even though in a specific testing sentence, "Penn Treebank dataset" was used to refer to the same corpus.

Task Dataset
In general, we think these mismatched predictions are reasonable in the sense that they capture the main semantics of the referents.Note that the numbers reported in Table 3 are based on exact match.Sometimes requiring exact match may be too restictive for downstreaming tasks.Therefore, we carried out an additional evaluation for the best Flair-TDM model using partial match from Se-mEval 2013-Task 9 (Segura-Bedmar et al., 2013), which gives us a micro-average F1 of 76.47 for type partial match.

An Initial TDM Knowledge Graph
In this section, we apply the Flair-TDM tagger to around 30,000 NLP papers from ACL Anthology to build an initial TDM knowledge graph.
We downloaded all NLP papers from 1974 to 2019 that belong to ACL from the ACL Anthology9 .For each paper, we collect sentences from the title, the abstract/introduction/dataset/corpus/experiment sections, as well as from the table captions.We then apply the Flair-TDM tagger to these sentences.Based on the tagger results, we build an initial graph G using the following steps: • add a TDM entity as a node into G if it appears at least five times in more than one paper; • create a link between a task node and a dataset/metric node if they appear in the same sentence at least five times in different papers.
By applying the above simple process, we get a noisy TDM knowledge graph containing 180k nodes and 270k links.After checking a few dense areas, we find that our graph encodes valid knowledge about NLP task/dataset/metric. Figure 1 shows that in our graph, the task "SRL" (semantic role labelling) is connected to a few datasets such as "FrameNet", "PropBank", and "NomBank" that are standard benchmark datasets for this task.
Based on the tagged ACL Anthology and this initial noisy graph, we are exploring various methods to build a large-scale NLP TDM knowledge graph and to evaluate its accuracy/coverage in an ongoing work.

Conclusion
In this paper, we have presented a new corpus (TDMSci) annotated for three important concepts (Task/Dataset/Metric) that are necessary for extracting the essential information from an NLP paper.Based on this corpus, we have developed a TDM tagger using a simple but effective data augmentation strategy.Experiments on 30,000 NLP papers show that our corpus together with the TDM tagger can help to build TDM knowledge resources for the NLP domain.
• Exclude the head noun of 'task/problem' when annotating task (e.g., only annotation "link prediction" for "the link prediction problem") unless they are the essential part of the task itself (e.g., CoNLL-2012 shared task, SemEval-2010 relation classification task).
(3) In order to learn models for disambiguating a large set of content words, a high-coverage sense-annotated corpus is required.

Figure 1 :
Figure 1: A subset of the TDM graph.

Table 1 :
In other words, Inter-annotator agreement.

Table 2 :
Statistics of task/dataset/metric mentions in the training and testing datasets.

Table 3 :
Results of different models for task/dataset/metric entity recognition on TDMSci test dataset.
Luan et al. (2018)al features and a BiLSTM-CRF model for this task.To compare with the state-of-the-art entity extraction model on scientific literature, we also use SciIE fromLuan et al. (2018)to train a TDM entity recognition model based on our training data.Below we describe all models in detail.

Table 4
Task and Dataset entities can contain other entities (see Table4, row 12).If both the full name and the abbreviation are present in the sentence, annotate the abbreviation with its corresponding full name together.For instance, we annotate "20-newsgroup (20NG)" as a dataset entity in Example 2.Factual entity.Only annotate "factual, contentbearing" entities.Task, dataset, and metric entities normally have specific names and their meanings are consistent across different papers.In Example 3, "a high-coverage sense-annotated corpus" is not a factual entity.

Table 4 :
Examples of entity span annotation guidelines