SemEval-2021 Task 11: NLPContributionGraph - Structuring Scholarly NLP Contributions for a Research Knowledge Graph

There is currently a gap between the natural language expression of scholarly publications and their structured semantic content modeling to enable intelligent content search. With the volume of research growing exponentially every year, a search feature operating over semantically structured content is compelling. The SemEval-2021 Shared Task NLPContributionGraph (a.k.a. ‘the NCG task’) tasks participants to develop automated systems that structure contributions from NLP scholarly articles in the English language. Being the first-of-its-kind in the SemEval series, the task released structured data from NLP scholarly articles at three levels of information granularity, i.e. at sentence-level, phrase-level, and phrases organized as triples toward Knowledge Graph (KG) building. The sentence-level annotations comprised the few sentences about the article’s contribution. The phrase-level annotations were scientific term and predicate phrases from the contribution sentences. Finally, the triples constituted the research overview KG. For the Shared Task, participating systems were then expected to automatically classify contribution sentences, extract scientific terms and relations from the sentences, and organize them as KG triples. Overall, the task drew a strong participation demographic of seven teams and 27 participants. The best end-to-end task system classified contribution sentences at 57.27% F1, phrases at 46.41% F1, and triples at 22.28% F1. While the absolute performance to generate triples remains low, as conclusion to the article, the difficulty of producing such data and as a consequence of modeling it is highlighted.


Introduction
Traditional search models over scholarly communication are now changing toward Knowledge Graph (KG) models operating on structured fine-grained scholarly content offering enhanced contextual search results.Several initiatives exist to this end: Google Scholar, Web of Science (Birkle et al., 2020), Microsoft Academic Graph (Wang et al., 2020), OpenAIRE Research Graph (Manghi et al., 2019), Open Research Knowledge Graph (Auer, 2018), Semantic Scholar (Fricke, 2018) to name just a few.These KG models differ in their content, their level of detail, etc., as they represent diverse aspects of scholarly communication.
Text, of course, is of seminal importance to Science.It is as important as experimentation itself; unpublished research lacks validity.Seen in another angle, it is hard to imagine a medium other than discourse that can convey a comprehensive picture of the scholarly investigation.For the wider research audience, it is interesting to read the full "stories" of Science.
Nonetheless, since scientific literature is growing at a rapid rate (Johnson et al., 2018) and researchers today are faced with this publications deluge (Landhuis, 2016), it is increasingly tedious, if not practically impossible to keep up with the research progress even within one's own narrow discipline.In this regard, among the existing scholarly knowledge structuring initiatives, the Open Research Knowledge Graph (ORKG) (Auer et al., 2020) is posited as a solution to the problem of keeping track of research progress minus the cognitive overload that reading dozens of full papers impose.It aims to build a comprehensive KG that publishes the research contributions of scholarly publications per paper, where the contributions are interconnected via the graph even across papers.The ORKG digital library (DL) framework can be accessed here https://www.orkg.org.
Motivated by the availability of a nextgeneration DL, we present the SemEval-2021 NLP-CONTRIBUTIONGRAPH (NCG) Shared Task as a step in the easier knowledge acquisition of contri-

Data Annotation Scheme
A trial annotation stage preceded the annotation of the Shared Task dataset.In this stage, an annotation scheme was prescribed.This involved specifying the annotation data granularities and the 12 IUs for organizing the triples.Observations were also obtained about the position in the articles where the authors generally stated the contribution.The trial annotations were conducted in two steps: a pilot annotation step (D 'Souza and Auer, 2020) followed by an adjudication step (D' Souza and Auer, 2021).The resulting scheme itself was called the NLPCONTRIBUTIONGRAPH (NCG) scheme.
For the trial stage, a relatively small dataset of 50 articles uniformly distributed across five NLP tasks, i.e. machine translation, named entity recognition, question answering, relation classification, and text classification, were selected.
Overall, after the pilot annotation task the follow-ing core question was answered.Could a scheme be defined such that it would encompass all annotation decisions of the task?In reality, it was found that the scheme could only define high-level annotation decisions such as: where in the article could the contribution information generally be found?E.g., the title, the abstract, a few lines in the Introduction, the first few lines of the Results section.This still entailed making subjective decisions such as if the model is not described in the Introduction then the first few lines of the model description section would need to be annotated.The scheme also specified the 12 IUs for organizing the structured triples.The choice of the specific IU for organizing the triples was based on the closest section title.
After the two-step trial annotation stage, the intra-annotation agreement between the pilot and adjudication steps, in terms of F1, was 67.92% for sentences, 41.82% for phrases, and 22.31% for triple statements indicating that with increased granularity of the information, the annotation adjudication was greater (2021).
The trial annotations were made by a postdoctoral researcher in Computational Linguistics.The same experienced annotator also annotated the full dataset.Next, we explain the NCG data with a focus on the KG and then offer two supporting examples as illustrations of the data.

Understanding our Knowledge Graph
The NCG KG used two levels of knowledge systematization: 1) At the root, it defined a dummy node called CONTRIBUTION.And following the root node, 2) it defined the 12 nodes introduced earlier and generically referred to as Information Units or IUs.Each scholarly article's annotated contribution triple statements were organized under three (mandatory) or more of these IU nodes, depending on whether they applied to the article.Next, we provide details about each IU.
RESEARCHPROBLEM The research challenge addressed by a contribution.In other words, a focus of the research investigation or the issue for which a research solution was proposed.

APPROACH or MODEL
The contribution of the paper as the solution proposed for the research problem.This unit was called APPROACH when the solution was proposed as an abstraction, and was called MODEL if the solution was proposed in practical implementation terms.Further, in case the solution was not referred to as approach or model in the article, the reference was normalized as either APPROACH or MODEL.E.g., references like "method" or "application" were normalized as APPROACH; on the other hand, references like "system" or "architecture," were normalized to MODEL.This unit captured only proposed system highlights.CODE The contribution resource; the link to the software on an open-source hosting platform such as Gitlab or Github or on the author's website.
DATASET Like CODE, this a contributed resource in the form of a dataset.
EXPERIMENTALSETUP or HYPERPARAME-TERS Details about the platform including both hardware (e.g., GPU) and software (e.g., Tensorflow library) for implementing the machine learning solution; and of variables, that determine the network structure (e.g., number of hidden units) and how the network is trained (e.g., learning rate), for tuning the software to the task objective.It was called EXPERIMENTALSETUP only when hardware details were provided.
BASELINES The systems that a proposed AP-PROACH or MODEL were compared with.

RESULTS
The main findings or outcomes reported in an article for the RESEARCHPROBLEM.
TASKS The APPROACH or MODEL, particularly in multi-task settings, are tested on more than one task, in which case, this unit was defined to capture all the experimental tasks.Unlike the earlier units, the TASKS IU was a container for more than one of the earlier mentioned IUs.Specifically, each task listed in TASKS could include one or more of the EXPERIMENTALSETUP, HYPERPARAMETERS, and RESULTS as sub-information units.
Furthermore, since it is common in NLP for tasks to be defined over datasets, experimental tasks are often synonymous with the experimental datasets, therefore this unit was also applied in articles where the datasets were explicitly listed instead of the task names.
EXPERIMENTS The second container information unit, like TASKS, defined to include one or more of the previous discussed units as subinformation units.This unit encapsulated several TASKS themselves and consequently, the units that TASKS encapsulated, i.e.EXPERIMENTALSETUP and RESULTS, or a combination of APPROACH, EXPERIMENTALSETUP and RESULTS.
ABLATIONANALYSIS A form of RESULTS that describes the performance of components in an APPROACH or MODEL.

Data Examples
Below, we show two examples of two different IUs, viz.RESEARCHPROBLEM and MODEL, respectively, as illustrations of our data.(Cho et al., 2014).We show two formats of our data: the JSON format (see Fig. 1) with all three annotated information granularities; and the triples format (see Table 1) showing only the annotated data for a KG.In the JSON data, the dummy root node CONTRIBUTION is left unspecified, however, it is specified in the triples.For this data, three phrases that named the research problem were annotated.The phrases were attached to the dummy root node by the predicate "has research problem."Further, in the JSON data, following the predicate "from sentence," the selected contribution sentences are listed.tences (Hu et al., 2014).See Fig. 2 for the JSON format and Table 2 for the triples data.Table 2: Annotated MODEL Information Unit contribution data as triples.This data is obtained from the JSON data illustrated in Fig. 2.

Data Statistics
Overall, the NCG Shared Task dataset had 50 articles in the trial data, 237 articles in the training data, and 155 articles in the test data.The trial data articles uniformly spanned five tasks, the training data spanned 24 tasks, and the test data spanned 10 tasks.For the Shared Task itself, participants were encouraged to merge the trial and training datasets.Thus, the overall training data had 287 articles representing 29 unique tasks.The training and test tasks were mutually exclusive except for one, i.e. 'natural language inference.'Table 3 shows further detailed statistics of the NCG dataset in terms of each of the annotated information granularities.
Our full dataset is publicly released online (D 'Souza et al., 2021).

Task Description
Our comprehensive NCG Shared Task formalism was as follows.Given a scholarly article A in plaintext format, the goal was to extract (1) a set of contribution sentences C sent = {C sent 1 , ..., C sent N }, (2) a set of scientific knowledge terms and predicates from C sent referred to as entities E = {e 1 , ..., e N }, and (3) to organize the entities E as a set of (subject,predicate,object) triple statements T = {t 1 , ..., t N } toward KG building organized under three or more of the 12 total IUs.Task Evaluation Phases.The task comprised three evaluation phases, thereby enabling detailed system evaluations.
Evaluation Phase 1: End-to-end Pipeline.In this phase, systems were tested for the comprehensive end-to-end KG building task described in the formalism above.Given a test set of articles A in plaintext format, the participating systems were expected to return: (1) a set of contribution sentences C sent , (2) a set of scientific knowledge terms and predicates from C sent , i.e. entities E, and (3) the entities in E organized in a set of triple statements T toward KG building.System outputs were evaluated for the three aspects and overall.
Evaluation Phase 2, Part 1: Phrases and Triples.In this phase, systems were tested only for their capacity to extract phrases and organize them as triples.Given a test set of articles A in plain-text format and contribution sentences C sent from each article, each system was expected to return: (1) the entities E, and (2) the set of triple statements T .
Evaluation Phase 2, Part 2: Triples.In this phase, systems were tested only for the triples formation task.Thus, given gold entities E for the set of C sent , systems were expected to form triple statements T .In the Evaluation phases that lasted from Jan 10 till Feb 1, 2021, we provided the participants with masked versions of the test set based on the current evaluation phase.The test set annotations in each phase were uploaded to CodaLab and were not available to the participants.To obtain results, the participants were expected to upload their system outputs to Codalab where they were automatically evaluated by our script and reference data stored on the platform.In each evaluation phase, teams were restricted to make only 10 submissions and only one result, i.e. the top-scoring result, was shown on the leaderboard.Before the task began, our participants were onboarded via our task website https://ncg-task.github.io/.
Further, participants were encouraged to discuss their task-related questions via our task Google groups page at https://groups.google.com/forum/#!forum/ncg-task-semeval-2021.
The NCG Data Collection of Articles Our base collection of scholarly articles was downloaded from the publicly available leaderboard of tasks in AI called https://paperswithcode.com/.While paperswithcode predominantly represents the NLP and Computer Vision research fields in AI, we restricted ourselves just to its NLP papers.From their overall collection of articles, the tasks and articles in our final data were randomly selected.The raw articles' pdfs needed to undergo a two-step preprocessing before the annotation task. 1) For pdf-to-text conversion, the GROBID parser (GRO, 2008(GRO, -2020) ) was applied; following which, 2) for plaintext pre-processing in terms of tokenization and sentence splitting, the Stanza toolkit (Qi et al., 2020) was used.The resulting pre-processed articles could then be annotated in plaintext format.Note, our data consists of articles in English.

Evaluation Metrics
The NCG Task participating team systems were evaluated for classifying contribution sentences, extracting scientific terms and relations, and extracting triples (see specific details in Section 3).The results from the three evaluations parts were also cumulatively averaged as a single score to rank the teams.Finally, for the evaluations, the standard precision, recall, and F1-score metrics were leveraged.
This completes our discussion of the NCG task in terms of its dataset definition and overall organization description.In the remainder of the paper, we shift our focus to the participating teams.Specifically, we describe the participating systems and examine their results for the NCG task.

Participating System Descriptions
The NCG Shared Task received public entries from 7 participating teams in all.In this section, we briefly describe the teams' systems in terms of the three parts of the NCG task, i.e. contribution sentence classification, scientific terms and relations extraction, and triples extraction.

Contribution Sentence Classification
To identify the contribution sentences from articles, systems adopted one of two strategies: a binary classification objective, or a multi-class classification objective.In the first strategy, sentences were either classified as contribution sentences or not.In the second strategy, sentences were classified in a 13-class classification task as one of the 12 IUs or as a non-contribution sentence.Next, we describe these strategies.Note, the asterisk superscripts against team names, where present, correspond to * * * 3rd best, * * 2nd best, and * 1st best systems in the Shared Task, respectively.
Binary Classifiers Team YNU-HPCC (Ma et al., 2021) employed BERT as a binary classifier to classify the contribution sentences.Team IN-NOVATORS (Arora et al., 2021)  These binary and multi-class sentence classifiers, were also adapted to our following dataset characteristics.

Contribution sentences data imbalance
Characteristically, of all the sentences in training data scholarly articles, only 10% were annotated as contribution sentences.Thus, our dataset presented an imbalanced classification task.
INNOVATORS established a threshold based on cumulative contributing sentence bigram scores as a filter; ITNLP fixed the ratio of positive to negative samples as an integer and tuned the value.

Differing tasks coverage between the training and the test datasets
Since only one task was in common between the training and the test datasets, this meant that systems trained only on the training data would be applied on articles from nine new tasks as test data.To this end, Team ECNUICA hypothesized that if the classifier could see, i.e. somehow be trained on, the test data tasks, its performance could be boosted.They, thus, adopted the strategy of retraining their classification ensemble with silverlabeled test data instances.This followed the standard setup of training the classifier on the actual training data, applying it to the test data, and incrementally retraining the classifier leveraging the few confidently classified test instances.The instances were marked as silver training data only when all three ensemble classifiers predicted the same class.

Scientific Terms and Relations Extraction
After identifying the contribution sentences, systems then had to extract their scientific terms and relational predicates.
Sequence Labeling Systems Majority, i.e. six, of the seven participating systems adopted a sequence labeling approach.
1. Team YNU-HPCC used a pre-trained BERT model for sequence labeling of each token, obtaining embeddings for each token in the sequence, with softmax and argmax top layers which were shared across all tokens.
2. Team DULUTH trained a feature-based maximum-entropy Markov model (MEMM) to predict scientific terms in the contribution sentences.
3. Team ECNUICA extracted entities using RoBERTa (Liu et al., 2019) with a CRF layer and a BIO sequence labeling scheme.The input sequences to RoBERTa are modified with sub-title information.
4. Team KnowGraph@IITK * * * extracted phrases in the sentence by adding BiLSTM layers to the SciBERT + CRF model as a sequence labeler.To mark phrase boundaries, they used the BILUO scheme.
5. Team ITNLP * * employed the standard BERTbased model, however, in a sequence labeling setting.They trained ten different models by 10-fold cross-validation and used a voting count threshold scheme to extract the final set of entities.
6. Team UIUC BioNLP * used a BERT-CRF model for phrase extraction and type classification (Souza et al., 2019).They employed the BIO scheme to distinguish the scientific terms vs. predicate phrases.
Rule-based System Team INNOVATORS leveraged an unsupervised rule-based approach for phrase extraction.Using spaCy (Honnibal et al., 2020), they obtained dependency parses for each sentence.They then implemented a set of dependency tree node traversal heuristics for phrase extraction based on the dependency parses.

Triples Extraction
1. Team YNU-HPCC first classified the scientific terms in subject, predicate, and object roles using three binary BERT classifiers.These triples from each contribution sentence were then organized as the 12 IUs leveraging a 12-class contribution sentence classifier.This team, however, did not participate in the end-to-end evaluation task.
2. Team DULUTH applied Stanford Core NLP's dependency parser (Chen and Manning, 2014) to generate a dependency parse for each contribution sentence.They used the dependency parse structures to assign subject, relation, and object phrase roles to the extracted scientific terms.These were then organized as triples per IU obtained by their 13-class sentence classifier.The overall end-to-end pipeline system score achieved by this system is 28.38%.
3. Team INNOVATORS implemented a set of rules based on the dependency parses to form triples from the extracted scientific terms.They used a CNN-based architecture for classifying the contribution sentences as the 12 IUs.Their end-to-end score was 32.05%.
4. Team ECNUICA approached the triples formation task in two steps: i) they formed triple candidates based on the scientific term sequence order in the sentence.Additionally, they employed a set of predefined predicates when the predicates were not directly found in the sentence.ii) They then employed a SciBERT-based binary classifier to classify the triples as true or false candidates.Their overall end-to-end system score was 33.35%.
5. Team KnowGraph@IITK * * * addressed the RESEARCHPROBLEM, CODE, BASELINES and ABLATIONANALYSIS IUs by a heuristicsbased approach.For the remaining eight IUs triples, they followed a 3-step approach: i) identify predicates from the scientific terms using a binary SciBERT+BiLSTM classifier; and ii) formed triples by arranging the terms and predicates in exact order as they appear in the original sentence; and iii) employ an 8-class SciBERT + BiLSTM classifier to classify the triples.Their overall end-to-end system score was 37.83%.
6. Team ITNLP * * extracted triples as follows: i) they formed all possible triples candidates from the classified scientific terms; and ii) employed a binary BERT classifier for true or false candidates.Prior to BERT classification, they perform the negative candidate triples downsampling as follows: by artificially generating them using random replacement (RR) of one of the arguments of the true triples with a false argument; and by random selection (RS) of triples where no argument is a valid pair of another.Additionally, each of their system components obtained boosted performances with the Friendly Adversarial Training strategy (Zhang et al., 2020).Their overall end-to-end system score was 47.03%.
7. Team UIUC BioNLP * categorized the triples into six types based on our dataset characteristics.Four of the six types were: structuring intra-sentence information; linking sentence information to IU; linking IU to the root node; and structuring inter-sentence information.The first two of the four broad types were further subdivided into two based on whether the predicate was found in the sentence or was the term "has."Each of the six types were addressed by a specifically trained BERT classifier.They obtained an overall end-to-end system score of 38.28% within the task deadline and 49.72% a day later after fixing phrase component offset errors.

Shared Task Results
In this section, we present the results of the seven participating teams' systems.
The results in Table 4 show the cumulative scores of the participating teams in each of the three evaluation phases in our Shared Task.We refer the reader to Section 3 for a detailed description of the three evaluation phases.In each phase, Teams were officially ranked by these scores.Next, we examine the scores by the individual extraction task deadline due to an error in their submission offsets for phrases.Thus, they are officially 2nd after the ITNLP team within the Shared Task timeline for Phase 1.
tasks that constituted building the NLPCONTRIBU-TIONGRAPH per article.

Contribution Sentences Classification
As a first step toward building the NLPCONTRI-BUTIONGRAPH, systems were evaluated for identifying contribution sentences.This was done only in the Evaluation Phase 1 of the Shared Task, i.e. the phase that tested the end-to-end systems.These results are shown in Table 5 under column "Sentences."This subtask attained a high score of 57%.The top two teams, i.e.UIUC BioNLP * and ITNLP * * , differed by only 1 point.Comparing these performances to a baseline, a default system would return all titles as candidate contribution sentences.This results in a score of 10.78% F1 at 90% precision and 5.7% recall.In contrast to the 1 sentence per article result in the default computation, our actual data averages at 17 sentences per article.Thus the default score was computed on a significantly underestimated data sample as also reflected by its low recall.Nevertheless, the top systems significantly outperform this default score with both systems averaging at 20 sentences per article.The least score was also significantly better than the default at 38.1% F1 at an average of 12 sentences per article.
With F1 less than 60%, the task shows itself challenging.Some teams ascribed this to the dataset characteristic that contribution sentences constituted only a minority of the sentences in the article (<10%) and thus, overall, presented imbalanced data.To address this they downsampled the data.However, from the two participant systems that used a downsampling strategy, it could not be conclusively verified as an effective strategy since these systems performed on opposite ends of the performance spectrum.On the other hand, incorporating the closest section header and sentence position as features in the BERT model showed itself an effective and reliable strategy for sentence classification.This modeled the dataset better since the sentences were annotated from a few sections and the sentences were usually close to the section header.The system UIUC BioNLP * that incorporated such features outperformed all other systems including the ones with the downsampling strategy, i.e.ITNLP * * and INNOVATORS.
Finally, how did bootstrapping the test data as silver-labeled data impact model performance?Team ECNUICA that adopted this strategy did not obtain a balanced harmonic mean between their precision and recall achieving the highest recall among all teams of 82.48% and the lowest precision of 26.21%.Thus this strategy did not show itself too effective and reliable.

Scientific Terms and Relations Extraction
These results are shown in Table 5 under column "Phrases" for the end-to-end systems.The highest F1 obtained on this task was 46.41%.However, this score was impacted by the pipeline setup such that the low performance in sentence classification impacted the performance in this stage.We conducted a separate evaluation phase to control for this as-pect.In other words, we examined how would the systems perform only on extracting terms and relations given gold contribution sentences?These results are shown in Figure 3 (a).In fact, the bar chart offers a perspective on the significant differences in system performances when applied on automatically extracted sentences versus gold data.The systems showed the same performance ranking order in both settings.This is a somewhat expected result since none of the systems implemented any specific noisy sentence handling strategy in which case performance differences may have risen.In conclusion, the best result was 46.4% F1 in the end-to-end setting and was 78.6% F1 when given gold sentences.
Notably, the pipeline systems were 10 points lower for extracting phrases than for sentences.

Triples Extraction
The final extraction task to build the NCG per article was to form triples from the extracted terms and relations.These results for the pipeline systems are shown in Table 5 under column "Triples."The best performance was 22.28% F1 and the 2nd best was significantly lower at 13.79% F1.To evaluate system performances purely for extracting triples, thereby cancelling out the effect of the pipeline setup, additional evaluations were conducted wherein gold data were incrementally made available to the system.These results are shown in Figure 3 (b).Given only the gold sentences, the best team attained 43.44% F1; given gold terms and relations in addition, they achieved 61.29% F1.A score of 61.29% F1 is a strong performance on a still fairly difficult task given the annotation decision subjectivity that may have crept into the data thereby producing considerable variations in annotation patterns.This is discussed in Section 7.

Identifying only the Information Unit Labels
We conducted a meta-evaluation for identifying the set of IU labels per article.These results are shown in Table 5 under column "Information Units."The top two teams were tied at 72.93% F1 with the second best score at 60.54% F1.Like sentence classification, a default system could be implemented for this task as one that output just the three mandatory IUs, i.e.RESEARCHPROBLEM, MODEL, and RESULTS for all articles.The scores from this default system were 69.01%F1, 81.67% precision, and 59.76% recall.It is 9 points better than the 2nd best.When given gold sentences, systems could be evaluated for identifying just the IUs since the classification were dependent on the underlying sentences.These results are shown in Fig. 4.
A notable exception in the results is that the IU classification score by Team INNOVATORS remained unchanged regardless of pipelined or gold sentences as input.This is because their downsampling heuristic once designed did not rely on the underlying data when filtering.It is likely that the new gold sentences information was not used at all.

Discussion
Finally, we conclude our Shared Task paper with a discussion on the perceived limitations of our dataset that can potentially be addressed in future work.Thereby, a new dataset will present new opportunities to evaluate systems on this novel task.
Single Annotator Annotations The NCG Shared Task dataset was annotated by a single annotator.Further, the design of the annotation scheme was supported by only an intra-annotator consensus agreement score for that annotator.Since this work is the first-of-its-kind in proposing an initial scheme, and given the complex nature of this annotation task with the need to design a model within a realistic timeframe, our annotation procedure is well-suited.However, as discussed in our related work (D'Souza and Auer, 2021), in the next stage, we advocate for a blind, multi-stage, and multi-annotator annotation process for the NCG scheme, recognizing it as a potentially better annotation model.We find that such a process while incorporating multiple worldviews could better address annotation inconsistencies that may have crept in in our current dataset.
Non-uniform Distribution of Articles As discussed earlier, our combined training dataset had 29 tasks and the test data had 10 tasks.However, these tasks did not have a uniform distribution of articles in our data.In the training data, the number of articles per task ranged from a maximum of 101 in one task, i.e. "natural language inference," to a minimum of one article in seven tasks -58.62% of the training data tasks had less than 5 articles.The test dataset, on the other hand, followed a more uniform distribution than the training data ranging from a maximum of 32 articles to a minimum of seven articles at an average of 15.5% articles per task.While our training dataset had over 200 articles, it may not have been sufficiently representative to learn uniform patterns.Thus in a new version of the dataset, a more uniform representation of the tasks will be attempted.

Conclusions
We have detailed the NLPCONTRIBUTIONGRAPH Shared Task that entailed structuring research contributions in NLP articles as structured KGs.This task is the first-of-its-kind to be organized in the SemEval series.It attracted a strong participation demographic of 27 participants and seven teams -BERT transformer models were a popular choice among the participant systems in two different capacities, i.e. as classifiers or sequence labelers.Our task also saw the use of traditional parsers such a dependency syntax parsing technology.Further, some systems leveraged a hybrid approach including a combination of heuristics and machine learning.While the end-to-end task performance was low showing the task considerably challenging, each individual subtask toward obtaining an NCG, i.e. contribution sentence classification, scientific terms and relations extraction, and triples formation, demonstrated high performances in the subtask-only evaluation setting, i.e. when given gold data from the previous stage.The best system adopted a hybrid approach which seemed the most effective strategy for building the NCG.
The NCG dataset is publicly available (D 'Souza et al., 2021) and a KG overview of a structured form of our paper is here https://www.orkg.org/orkg/comparison/R74774.

Figure 1 :
Figure 1: Annotated data in JSON format for the RESEARCHPROBLEM Information Unit for the paper "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation."

Figure 2 :
Figure 2: Annotated data in JSON format for the MODEL Information Unit for the paper "Convolutional Neural Network Architectures for Matching Natural Language Sentences." also employed a BERT-based binary classifier wherein each instance was a set of 10 sentences with additional sentences as context features to the model.Team KnowGraph@IITK * * *(Shailabh et al., 2021) used the standard SciBERT + BiLSTM architecture(Beltagy et al., 2019) as a binary sentence classifier.Team UIUC BioNLP *(Liu et al., 2021) employed BERT-based binary sentence classifier with features that handled sentence characteristics w.r.t.their context in the article -specifically, its closest preceding topmost and innermost section headers and its position in the article.Multi-class Classifiers Team DULUTH (Martin and Pedersen, 2021) framed a 13-class multiclass classification task.They employed de-BERTa(He et al., 2020) as their classifier.Team ECNUICA(Lin et al., 2021) employed three pretrained transformer models, viz.RoBERTa(Liu et al., 2019), SciBERT(Beltagy et al., 2019), and BERT(Devlin et al., 2019) as an ensemble classifier.They formulated a multi-class classification task as well.The features to BERT models are the original sentence, contextual information as previous and next sentence to the original sentence, and a sub-title of the paragraph with the separator token ([SEP]) in between.Team ITNLP * *(Zhang et al., 2021) employed a BERT-based multi-class classifier that leveraged sentence context and the paragraph heading as additional features.
Figure 3: (a) Phrases and (b) Triples extraction results

Figure 4 :
Figure 4: Information Unit identification results in Evaluation Phase 1: End-to-end Pipeline with Pipelined Sentences (blue bars) and Evaluation Phase 2, Part 1 and Part 2 with Gold Sentences (red bars)

Table 1 :
Annotated RESEARCHPROBLEM Information Unit contribution data as triples.This data is obtained from the JSON data shown in Fig 1.

Table 3 :
NLPCONTRIBUTIONGRAPH Shared Task 2021 Overall Corpus StatisticsFor the convenience of the participants, the task was divided into four phases.In the Practice phase, which began on Aug 16, 2020, we released the participant kit that included the full training dataset along with the Python code of the official scoring program https: //github.com/ncg-task/scoring-program.

Table 5 :
Evaluation Phase 1: End-to-end Pipeline Results