KnowGraph@IITK at SemEval-2021 Task 11: Building Knowledge Graph for NLP Research

Research in Natural Language Processing is making rapid advances, resulting in the publication of a large number of research papers. Finding relevant research papers and their contribution to the domain is a challenging problem. In this paper, we address this challenge via the SemEval 2021 Task 11: NLPContributionGraph, by developing a system for a research paper contributions-focused knowledge graph over Natural Language Processing literature. The task is divided into three sub-tasks: extracting contribution sentences that show important contributions in the research article, extracting phrases from the contribution sentences, and predicting the information units in the research article together with triplet formation from the phrases. The proposed system is agnostic to the subject domain and can be applied for building a knowledge graph for any area. We found that transformer-based language models can significantly improve existing techniques and utilized the SciBERT-based model. Our first sub-task uses Bidirectional LSTM (BiLSTM) stacked on top of SciBERT model layers, while the second sub-task uses Conditional Random Field (CRF) on top of SciBERT with BiLSTM. The third sub-task uses a combined SciBERT based neural approach with heuristics for information unit prediction and triplet formation from the phrases. Our system achieved F1 score of 0.38, 0.63 and 0.76 in end-to-end pipeline testing, phrase extraction testing and triplet extraction testing respectively.


Introduction
Given the advancements in Natural Language Processing (NLP), a large number of research papers are published every year. However, given the field's dynamic nature, keeping track of all the papers is a non-trivial task. This motivated the formulation * Authors equally contributed to this work. of an Open Research Knowledge Graph (Jaradeh et al., 2019), a knowledge graph of research contributions and the relation between them. Task 11 of SemEval 2021 (D'Souza et al., 2021) formalizes the building of a contributions-focused knowledge graph of NLP literature. The task is divided into three sub-tasks: • Sub-task A Extracting sentences that posit contributions in a research paper.
• Sub-task B Extracting relevant phrases that include scientific terms and relational cues from the extracted sentences of the sub-task A.
• Sub-task C Triplet (subject phrase, predicate phrase, object phrase) formation from the extracted phrases of the sub-task B and classification of the triplet in one of the information units (IU). There are twelve information units (Research problem, Approach, Results, Model, Code, Dataset, Experimental setup, Hyperparameters, Baselines, Tasks, Experiments, and Ablation analysis), each focusing on different sections in a research paper. These information units can represent all the findings given in a research paper. Out of twelve, first three information units are present in each article.
Recently, transformer-based approaches have been popular for NLP applications (Liu and Lapata, 2019;Yao et al., 2019). For the sub-task A, we propose a sentence level classifier, leveraging SciBERT (Beltagy et al., 2019) and Bidirectional Long Short-Term Memory (BiLSTM) (Hochreiter and Schmidhuber, 1997). SciBERT model has been trained on 1.14M scientific papers from Semantic Scholar corpus, which has 18% papers from the computer science domain. Our system for subtask B also uses SciBERT based model using CRF 2 -Learning Semantic Sentence Embeddings using Discriminator 3 -abstract 4 -We propose method for obtaining sentence -level embeddings . 22 -Our model consists of a sequential encoderdecoder that is further trained using discriminator . 26 -This , however , does not imply that the whole sentence is correctly generated . 28 -We further ensure that this is close to desired ground -truth embeddings while being far from other embeddings. 148 -We start with baseline model which we take as a simple encoder and decoder network with only the local loss.  (Lafferty et al., 2001) on top of BiLSTM layers using BILUO (B=start token of phrase, I=interior tokens of phrase, L=last token of phrase, U=single token phrase and O=Non-phrase tokens) labelling scheme for tokens. For sub-task C, we build a combination of neural and heuristic-based approach. The IU prediction for sub-task C is at document level where two information units use heuristic approach while others use a multi-label classifier based on BiLSTM stacked on top of SciBERT. For triplet formation in sub-task C, we use a separate SciBERT+BiLSTM classifier along with heuristics. These triplets are classified into one of the already predicted information units, i.e., the output unit having the maximum score among the predicted IU. However, some of the IU, such as Baselines, Ablation analysis, Code and Research problem triplets, perform well using heuristic, so we use heuristic instead of a neural approach for these IU. Our proposed system was ranked third in the overall endto-end pipeline (A+B+C) testing and achieved F1 score of 0.38. Our proposed model was ranked fourth in phrase extraction (A GT +B+C) and triplet extraction testing (A GT +B GT +C) with a F1 score of 0.63 and 0.76 respectively where A GT and B GT represent ground-truth for sub-task A and B respectively. Phrase extraction testing uses ground-truth labels for sub-task A, while triplet extraction testing uses ground-truth for both sub-task A and B.
We found that the heuristic-based model for two IU (Research problem and Code) in sub-task C can achieve high performance and achieved F1 score of 0.98 and 1.00, respectively. Our system imple-mentation code is made available via GitHub 1 .

Problem Definition
Consider a document D = {s 1 , s 2 , .., s i , .., s N } having N sentences s i . Sub-task A finds M contribution sentences denoted by S = {s 1 , s 2 , ..., s M } from D. Sub-task B selects phrases P = {p 1 , p 2 , .., p i , .., p L } where p i is a phrase selected from a sentence s ∈ S and L is total number of phrases in D. Sub-task C is forming triplets of extracted phrases for IU denoted by U = {u 1 , .., u i , .., u X } where u i is one of the twelve IU and X is number of IU in document D ranging between three to twelve. For each u i ∈ U , there is a triplet set called is a triplet representing subject, predicate(relation) and object respectively and O is total number of triplets in u i IU in document D. An example dataset is given in Figure 1 for reference. Here, on moving from left to right is research paper, contribution sentences, phrases and IU along with triplets given in a research paper respectively.

Related Work
Knowledge graphs (Rebele et al., 2016;Hertling and Paulheim, 2018;Lehmann et al., 2015;Carlson et al., 2010) have shown to be helpful in several areas such as search, knowledge extraction, inter alia. However, only a handful are based on research articles (Xu et al., 2020). Typically, most knowledge graphs are created with rule-based approaches, hence, limiting their performance and generalization. However, some recent approach such as Sang et al. (2018); Wang et al. (2020b) uses neural approach in biomedical literature. To the best of our knowledge, there is no available contributions-focused knowledge graph over NLP literature using the neural approach.
Sub-task A: The sub-task of extracting contribution sentences can also be posed as an extractive summarization problem (Nallapati et al., 2016;Narayan et al., 2018;Liu, 2019;Cheng and Lapata, 2016;Zhou et al., 2018;Dong et al., 2018;Wang et al., 2020a). BERTSUM (Liu and Lapata, 2019) and MATCHSUM (Zhong et al., 2020) are the recent methods leveraging language models and uses ROUGE-1, ROUGE-2 and ROUGE-L scores (Lin, 2004) on DailyMail data-set (Hermann et al., 2015). However, this extractive summarization technique may not be applicable in our case due to a number of reasons. Firstly, extractive summarization alone will not give all the contribution sentences because some sentences may not be relevant to the summarization task. Secondly, extractive summarization models are not tested on large documents such as research articles due to the limitation of the input token length for transformer-based language models. Some long document transformer-based methods are proposed (e.g., Beltagy et al., 2020), and can consider documents up to a length of 4096 tokens, however, in our case, documents have on an average ∼10,000 tokens. Some of the extractive summarization methods (Liu and Lapata, 2019;Miller, 2019) take the number of contribution sentences as a hyper-parameter, but in our case, this is a trainable parameter in our model.
Sub-task B: Sub-task B closely resembles the phrase extraction problem and several neural methods (Zhu et al. (2020); ; Zhang et al. (2016), inter alia) and non-neural based methods (using n-grams and noun-phrases with certain Part-of-speech (POS) patterns (Hulth, 2003)) have been proposed. Gollapalli et al. (2017) have shown that CRF has the potential to improve the existing phrase extraction model. Alzaidy et al. (2019) jointly leverages CRF and BiLSTM to capture hidden semantics for phrase extraction. Zhu et al. (2020) extended the work of Alzaidy et al. (2019) with the idea of self-training and used word em-beddings, POS embeddings, and dependency embeddings with a BILUO labelling scheme in the output. Our proposed model took inspiration from Zhu et al. (2020) and propose a SciBERT based model using CRF on top of BiLSTM layers using BILUO labelling scheme on tokens. Our model captures better semantics than the word embeddings based approach in Zhu et al. (2020) because of SciBERT, which is trained on the scientific corpus. Moreover, our model uses the WordPiece tokenizer and hence, robust to Out-of-Vocabulary (OOV) tokens. Sahrawat et al. (2020) used contextual embeddings to the BiLSTM and CRF model using BIO (B=start token of phrase, I=continuation tokens of phrase and O=Non-phrase tokens) labelling scheme. Ratinov and Roth (2009) discussed that the BILUO scheme is superior to the BIO scheme; hence we adopt the BILUO scheme for sub-task B. Recently, Lai et al. (2020) combined sequence labelling with joint learning inspired from self-distillation to boost model performance on unsupervised datasets. However, their model used BIO labelling scheme and gave a comparable or marginal improvement in a supervised setting. Sub-task C: The sub-task C can be divided into two parts -information units (IU) prediction and triplet formation. The information unit prediction and triplet formation have been approached in the literature mainly using rule-based methods. Rusu et al. (2007) suggests using syntactic parsers for generating parse trees, followed by triplet extraction using parser dependent techniques. Jivani et al. (2011) proposed an algorithm that exhibits the relationship between subject and object in a sentence using Stanford parser. This rule-based algorithm can form multiple triplets from a sentence as compared to Rusu et al. (2007). Stanford OpenIE Relation triplet formation (Angeli et al., 2015) uses a classifier, which learns to extract selfcontained clauses from longer sentences to form the final triplets using heuristics. Hamoudi (2016) and Jaiswal and George (2015) are also rule-based methods for triplet formation using Stanford dependency parser and constituency parser, respectively. KG-Bert (Yao et al., 2019) uses the BERT language model and utilize entity and relation descriptions of a triplet to compute its scoring function.

System Overview
The proposed system is shown in Figure 2 depicting the entire pipeline and its respective model.  is the number of sentences in the document and s ij is the j th sentence of document D i . Each sentence is assigned a ground-truth label where label "1" represents a contribution sentence and label "0" a non-contribution sentence. The sentences are processed using a SciBERT model followed by stacked BiLSTM layers whose output is further processed through linear layers with ReLU (Nair and Hinton, 2010) non-linearity. We add dropout layers to avoid overfitting. The last linear layer consists of two units corresponding to label "0" and label "1". The final output label is the label whose corresponding unit has a higher score in the last linear layer. Our loss function is weighted binary crossentropy loss, where weights are in according to the number of samples in each class.

Sub-Task B
Initial Experimentation: We built our initial method, the BiLSTM+CRF model, on the lines of Zhu et al. (2020). The model uses a BiLSTM layer along with CRF for sequence to sequence (BILUO) labelling in order to mark the phrases in a sentence. We introduced SciBERT (by replacing BiLSTM layers) to fine-tune and better generalise with an increase in performance. We tested the significance of CRF by replacing it with a softmax layer, which gave poor performance since it is unable to learn the constraints in the BILUO scheme. Proposed Approach: We further improved the SciBERT+CRF model to improve semantic information. Our proposed model stack BiLSTM layers on top of SciBERT, followed by CRF (see Figure  2). The word-level representation {x 1 , x 2 , ..., x N } from the input sentence passes through the SciB-ERT tokenizer. We used the representation of the first sub-token for every word as the input to the SciBERT. The tokenized input is passed into the SciBERT layer, followed by the BiLSTM layer. The final feature output is mapped to a hidden linear layer to get the score matrix Z, which is passed into the CRF layer for label prediction y. The CRF layer is the same as described in Lample et al. (2016). The output produced by the SciBERT + BiLSTM + Hidden Layer corresponds to a scoring matrix Z (n×l) where n denotes the number of words in the input sentence, and l is the number of labels (l=5). The score of an output sequence y using CRF is given by: where Z i,j denotes the score of word w i with the j th label, T y i−1 ,y i is the transition score from the label y i−1 to y i , y = {y 1 , y 2 , ...., y n } is the sequence of true labels and Scr(s, y) corresponds to output score for sentence s and true labels y. A softmax over all possible label sequences yields a probability for true labels y: P (y|s) = exp(Scr(s, y)) y ∈Y (s) exp(Scr(s, y )) where Y(s) corresponds to all the possible label sequences for sentence s. Now during training, our task is to maximize the log-probability of the correct label sequence y. Model loss is defined as follows: (3) where s i is the input sentence, y i is the corresponding true label sequence, Θ denotes the model parameters, λ is the regularization hyperparameter and M is the train set size. Output label prediction is made by: Here y * represents the final output label sequence, s is the input sentence, Y (s) is the set of possible label sequences and P (y |s) denotes the probability of getting y label sequence from sentence s. The phrases are extracted using BILUO scheme based on the prediction outputs.

Sub-Task C
Initial Experimentation: We used a combination of neural and rule-based approach for sub-task C. An IU triplet has three phrases -subject, predicate and object. Research problem and Code IU triplets have fixed subject and predicate in triplets. We employed a heuristic to scan the phrases of the first thirty lines of each document and select only those sentence's phrases that have only a single phrase extracted out. These phrases form object in Research problem IU triplets. A regex expression is used to extract all the sentences in the article that contain any URLs for Code IU triplet's object. However, only those URL sentences are selected which have token such as "our" or "code" or "our code". Our initial approach was to form the triplets for all other IU and classify them into one of the ten remaining information units using SciBERT + BiL-STM multi-class classifier (BiLSTM layers stacked on top of SciBERT). The heuristic for triplet formation is based on orthographic visualization of the document -firstly, the phrases are arranged in the exact order as they appear in the original sentence. Then, every three consecutive pair of phrases present within the same sentence are considered as one triple. This approach gave us decent results since most of the research paper is written in the active voice; hence the subject phrase should occur first, then its corresponding predicate and last should be the object phrase. The SciBERT + BiLSTM multi-class classifier takes concatenated triplets as the input and their corresponding information unit as the ground truth. The loss is a crossentropy loss, with the final softmax layer having ten classes corresponding to 10 information units. One of the drawbacks of the model is that the IU prediction depends on triplet formation's correctness. Further, a single triplet does not have enough context to be correctly classified into the correct information unit. We visualized triplet formation for feature extraction and found that some IU triplet's, such as Baselines and Ablation analysis, can be better formed using heuristics. Our proposed approach is a better triplet formation model and eliminates these limitations. Proposed Approach: Our proposed approach has some information unit such as Research problem and Code, whose model for prediction and triplet formation is entirely heuristic-based and the same as initial experimentation. Our proposed method for the rest of ten IU is divided into two parts -IU Prediction -We propose a SciBERT+BiLSTM multi-label classifier (BiLSTM layers stacked on top of SciBERT) (refer to Figure 2) whose input is the concatenated phrases and predicts the IU of the document. The concatenation is in the order of the occurrence of phrases in the document. Moreover, these concatenated extracted phrases represent the whole document since it includes all relevant keywords of the research article necessary for information unit prediction. Hence, our proposed model encodes information at the document level, which makes it superior to the initial experimentation method.
Triplet Formation and Classification -In general, a phrase is either a relational phrase or a scientific term (Figure 1). A specific trend observed in ground truth triplets is that a triplet's predicate is unique for a triplet. This contrasts with the subject and object phrase, which can be used multiple times in other triplets. Hence, a one-to-one relation between predicate phrase and triplet exists. We identify all predicate phrases from the extracted phrases of sub-task B; then, a corresponding triplet will be formed for each predicate phrase. We train a SciBERT+BiLSTM based binary classifier (BiL-STM layers stacked on top of SciBERT model) to identify the phrases which act as predicates (relational phrases). The model is fine-tuned on our dataset and uses the weighted binary cross-entropy loss. In labelling, "1" denote as predicate while "0" denote as non-predicate. To form triplets, we use a simple heuristic to arrange the phrases in the exact order as they appear in the original sentence. For every phrase predicted as a predicate, we take its previous phrase as the subject and its next phrase as the triplet object. Now, a multi-class classifier for triplet classification for 8 IU (corresponding to all IU except Research problem, Code, Baselines and Ablation analysis) (refer to Figure 2) is built, which is similar to the one described in the initial experimentation. Since we have already predicted the information units, during inference, the triplet can be assigned only to one of the already predicted information units, i.e., the output unit having the maximum score among the predicted IU.
We used rule-based heuristics for triplet formation of Baselines and Ablation analysis IU. The target sentences, whose phrases belong to baselines IU, are identified by selecting all the headings (i.e. lines having no punctuation) with words such as "baseline", "comp" using a regex expression. Then, we took all the sentences between selected headings and their consecutive headings as the target sentences. The phrases associated with extracted sentences are used for triplet formation via the rule of three consecutive phrases present within the sentence. The same method is followed for Ablation analysis IU triplets with only change that the relevant headings are found using the regex expression that identifies if that heading contains the word such as "ablation", "analysis".

Data
The dataset annotation scheme is as per D'Souza and Auer (2020). The pre-processed dataset con- sists of 287 annotated NLP research documents in the English language with ground truth for each sub-task. Train, dev and test set have 237, 50 and 155 documents, respectively. We have chosen 100 as the maximum token length in a sentence with WordPiece tokenizer since 99.7% sentences in the train set have less than or equal to 100 tokens. The Table 1 shows token length and percentage(%) of sentences less than that length. In the sub-task C, if there is no suitable predicate available in the extracted phrases, then the triplet's predicate is chosen from the predefined set of predicates, i.e. "has", "on", "by", "for", "has value","has description", "based on", "called". Table 2 shows the dataset statistics related to sub-task A and B. The dataset statistics for sub-task C is given in Table 3. Tasks information unit has no triplets in train set while   Code and Dataset information units have very few triplets in dev set.

Evaluation Metrics
In this task, organizers used Precision, Recall and F1 score metrics. In sub-task A, the predicted and ground-truth contribution sentences of the document calculate the metrics score. The sub-task B output has predicted phrase, contribution sentence number, starting and ending character number of the predicted phrase ( Figure 1). These outputs are the basis for sub-task B metric calculations. The sub-task C has two groups of metrics. First is the Information Units prediction of a document (regardless of triplets). The second group calculates using both Information Units and triplets (the predicted triplet is correct only if it exactly matches the ground truth triplet and the ground-truth IU; otherwise, it is incorrect). The final score is the average of all the four F1 scores on the dataset. The participating team rankings are according to this score. Table 4 shows all the participating teams average F1 score. In end-to-end pipeline testing, the input is the documents. In phrase extraction testing (A GT +B+C), the input is the document and ground-truth for sub-task A. Here, A GT represents the true label of sub-task A, and the F1 score for sub-task A is 1.00. In the triplet extraction phase (A GT +B GT +C), the input is the document and ground truth labels (F1 score = 1.00) for both sub-tasks A and B to the system. Our team "Know-Graph@IITK" achieved an F1 score of 0.3783 and ranked third in end-to-end pipeline testing.  Table 4: Average F1 score on test set of participating teams in end-to-end pipeline testing(A+B+C), phrase extraction testing(A GT +B+C) and triplet extraction testing(A GT +B GT +C).

Results
In phrase extraction and triplet extraction testing phases, our system ranked fourth with an F1 score of 0.6318 and 0.76, respectively. Our heuristicbased triplet extraction for Code and Research problem information unit achieved excellent performance with an F1 score of 1.00 and 0.9756, respectively, on the test set.  We built several models using SciBERT, LSTM and CNN for each sub-task to understand each method's significance (Table 5). We found that language models are knowledge-rich and boost the existing models. On top of language models, BiL-STM based model performs better than the CNNbased model due to long semantic dependency in sequential models. In sub-task B, BiLSTM+CRF based model performed inferior to the same model built on top of SciBERT. In triplet formation in sub-task C, our rule-based approach of Research problem and Code information unit yield excellent results (highest on the leaderboard). A significant improvement in the F1 score for Baseline and Ablation Analysis IU suggests that the rule-based approach can boost neural models since specific patterns are present for these information units' triplets. In sub-task A, the dataset is highly skewed between minority and majority classes (1:10), making the training of a neural model difficult. On visualization of sub-task B outputs, we found some ambiguous phrases that our model fails to predict correctly. Extracting both scientific and relation cue phrases with high precision and recall in a single model is difficult. Sometimes, our model predicts scientific phrases correctly but fails to predict relation cue phrase in the same sentence.

Ablation and Error Analysis
In sub-task C, some IU triplet's such as Model, Hyperparameters, Results were present in large number comparatively and while Tasks and Dataset IU triplets are scarce in number. This skewness resulted in biasing of our multi-label classification model for IU prediction. In triplet formation, our predicate approach fails when the predicate is not present in the sentence and selected from the organizer's closed set of predicates. The proposed system also fails in the case of branching between the triplets, i.e. multiple triplets share the same subject phrase. Our triplet formation is unable to predict Approach, Dataset and Tasks triple. On visualizing, we found that Model triplets are predicted instead of Approach triplets. Further, the triplets lack enough features to be classified into an information unit using a neural-based method. Our system performance on sub-task C is poor in the end-to-end pipeline and phrase extraction testing because we submitted our initial experimentation model.

Conclusion
In this paper, we have presented our system for NLPContributionGraph task of SemEval 2021. We found that neural models combined with heuristics can build a knowledge graph by dividing it into small tasks. The heuristic-based model can outperform the neural approach in the triplet formation of some Information Units. In future work, neural models can use sparse transformers to encode long documents without increasing much memory. The pre-defined predicate incorporation could also be a future direction of work for our system.