CLaC-np at SemEval-2021 Task 8: Dependency DGCNN

MeasEval aims at identifying quantities along with the entities that are measured with additional properties within English scientific documents. The variety of styles used makes measurements, a most crucial aspect of scientific writing, challenging to extract. This paper presents ablation studies making the case for several preprocessing steps such as specialized tokenization rules. For linguistic structure, we encode dependency trees in a Deep Graph Convolution Network (DGCNN) for multi-task classification.


Introduction
Scientific articles contain many quantities which have to be linked to their measured entities. Identifying quantities may seem as simple as digit recognition, but numbers alone are not informative. The entities and properties being measured, while crucial information, are difficult to extract. SemEval 2021 Task 8 (Harper et al., 2021) is a semantic relation extraction task consisting of 5 subtasks: identifying quantities and their modifying attributes, identifying measured entities and their properties as well as qualifying attributes, if specified.
Recent reports on the strong performance of purely neural models for NLP tasks often underreport the data preprocessing and postprocessing steps that accompany them. Preprocessing significantly influences overall performance. Typical NLP preprocessing steps include sentence splitting and tokenization, sometimes followed by task relevant gazetteer annotation, possibly named entity recognition (NER), part-of-speech (POS) tagging and dependency parsing. These preprocessing steps are so common that many different packages perform them, such as Stanford CoreNLP (Manning et al., 2014), spaCy 1 , and NLTK (Bird et al., 2009). 1 https://spacy.io/models Linguistically inspired features are, however, not regularly exploited and we present here one attempt at encoding dependency information for the structural task of linking quantities with their measured entities, measured properties or qualifiers.
We approach the MeasEval task as a multi-class classification task using a Deep Graph Convolution Neural Network (DGCNN) , treating the dependency parse tree as a graph to convolve over. We explore tokenization variants, as well as encodings of the dependency relations using node2vec (Grover and Leskovec, 2016) and UMAP(McInnes et al., 2018) techniques.

Problem Statement
The MeasEval (Harper et al., 2021) task consists of 5 (not independent) sub-tasks covering span detection, classification and relation extraction across multiple sentences 2 . Given a paragraph of scientific content in English, a system should: 1) label quantity spans (Q) where Q can be simple count or a numerical value with a unit. 2) if there is a unit it should labelled as Unit(U), and a Q should be classified into one of the types (count, approximate, range, list, mean, median, medianHasSD, mean-HasTolerance, rangeHasTolerance, hasTolerance) as Modifier(mod). 3) for each Q, systems should identify the span of a measured entity (ME) if one exists and also any measured properties (MP). 4) Identify any spans of qualifiers (QL) that record additional detail related to Q, ME or MP. 5) Label the relationships between Q, ME, MP and QL spans using HasQuantity(HQ), HasProperty(HP) and Qualifies relation types.

System Overview
Motivation: Unlike named entity detection tasks, MeasEval's ME or MP detection depends on quan-tities and their relation to other tokens within a sentence. Since dependency parse trees are capable of providing approximations of semantic relationships between predicates and their arguments, we opted to generalize over different dependency parse trees to obtain latent path representations to distinguish between different semantic connections that quantities have with MEs or MPs. To encode this path representation we use a Graph Convolutional Networks (GCN) (Kipf and Welling, 2017) in the form of a DGCNN , to operate directly on the dependency graph to capture higher order neighborhood information in the form of embeddings. This embedding is used for classifying the relationship type between the tokens to detect one of the 6 classes explained in Section 3.2.2 Our system has 3 main phases: preprocessing, input creation and training a GCN model and post processing respectively. Each phase communicates with the next through CoNLL format (Tjong Kim Sang and De Meulder, 2003) files.

Phase-I: Preprocessing
We preprocess data using the GATE (Cunningham et al., 2013) modules: ANNIE Tokenizer, ANNIE Sentence splitter, and Stanford Parser (POS tags and dependency graphs (de Marneffe et al., 2006)). Special tokenization rules added are: mixed character protection prevents splitting tokens of differing character types into different tokens, e.g. δ13CTOC → δ, 13, CTOC split mathematical symbols preserves the usual ANNIE tokenization for 5 ≤ 2θ/ • ≤ 80 into seven tokens (5, ≤, 2θ, /, • , ≤, 80) number normalization decimal numbers are prevented from being split into 3 tokens. Number words are also identified as numbers abbreviation period common abbreviations in scientific journals are recognized as integral tokens including the abbreviation period, e.g. e.g., Fig., sp., spp. This improves sentence splitting and tokenization list and interval protection scientific articles frequently report on intervals expressed in different ways and on lists of variable lengths. Both do not usually receive proper parse assignments, because the group as a whole plays a role in the text. To improve the dependency relation assignments, we manually assign the POS tag 'CD' to the groupings: CD (: | -| to) CD CD (, CD)* and CD unit gazetteer composed from different sources 3 listing 4280 units

Phase-II: Input creation and GCN training
We train a DGCNN  as a mulitilayer neural network that operates directly on a graph to induce node embeddings with properties of their neighborhood. DGCNN takes (A, I) as input, where A ∈ R n×n is an adjacency matrix and n is equal to the number of nodes in the graph G. I ∈ R n×c is an information matrix, associating c feature values with each of the n nodes. A single layer of DGCNN captures information according to: whereD is a diagonal degree matrix withD ii = j A ij (capturing the branching factor of node i) and W ∈ R c×c is a trainable parameter matrix. f is a nonlinear activation function and Z ∈ R n×c . Higher order neighborhood information is obtained by stacking multiple DGCNN layers: where Z 0 = I, Z t ∈ R n×ct is the output of the t th graph convolution layer, c t is the size of the output vector of layer t and W t ∈ R ct×c t+1 .
We model a dependency tree as graph G = (V, E), where V are tokens and E are directed dependency relations. We ensure (v, v) ∈ E for all v ∈ V . To represent paths in the dependency graph between any two nodes, we add explicit reverse links (to nsubj from governor to dependant we add rnsubj from dependant to governor).

Input creation
The DCNN classifier predicts six output classes, as defined in Section 3.2.2 for token pairs (t 1 , t 2 ), the subgraph center points.
The following sections show how (i) candidate token pairs are created, (ii) the smallest subgraph containing t1 and t2 is extracted, (iii) each subgraph SG is represented by (A SG , I SG ) Figure 1: System Architecture from phase-II to phase-III Subgraph center point candidates: In all pairs (t 1 , t 2 ), t 1 has to be a CD (a candidate for a quantity). 4 In principle, all nodes in the graph are candidates for t 2 , but we empirically set a limit of five on the connecting path length.
Subgraph extraction For each (t 1 , t 2 ), we select a subgraph containing only the shortest path recursively: SG 1 contains all neighbors of t 1 . SG k+1 contains SG k and all neighbors of SG k . We select the first subgraph that contains t 2 .
Adjacency matrix representation The adjacency matrix A ∈ R n×n is a binary matrix. Dependency relations from governor to dependent are one-to-many and all rows in the matrix are interpreted as dependants, the columns as governors.
Information matrix The information matrix I is a n × c matrix, where c is size of the the concatenated values for the five explicit or latent features associated with each vertex in our system: DISTANCE FEATURE We use the Double Radius Node Label (DRNL)  to calculate a combined distance of a node v i to both subgraph centre nodes within the information matrix as one hot vector as follows: Nodes t 1 and t 2 carry the label 1. For each v in the subgraph we calculate labels representing distance using the following hashing function: where f (v) assigns labels to all nodes v, d t1 and d t2 are distances of v with respect to t 1 and t 2 respectively. d = d t1 + d t2 , d/2 is the integer quotient and [d%2] is the remainder of d divided by 2. 4 Note that the gold relations connect text spans, not necessarily tokens. The classifier attempts to predict relations between tokens and the postprocessing phase maps the results to spans.
POS FEATURE encoded as a one hot vector WORD EMBEDDING from the PubMed ELMo model (Peters et al., 2018) of size 1024.
DEPENDENCY PATH EMBEDDING represents the dependency path of each node within the subgraph from t 1 (base node) (p(v, t 1 )). We create dependency embeddings of size 128 from dependency sequences in the training data using node2vec (Grover and Leskovec, 2016). Given a graph G, node2vec 5 uses a random walk procedure from each node v ∈ G to produce s sequences of length l, and uses these sequences for training node2vec. We embed dependency sequences instead of node sequences to produce embeddings for each dependency relationship type (i.e. we use node2vec to produce embeddings for edges instead of nodes). To represent p(v, t 1 ) we concatenate dependency embeddings for each dependecy along the dependency path. Given our empirical limit, p(v, t 1 ) ∈ R 5×128 and for smaller subgraphs we pad with 0's. UMAP EMBEDDING As either a complement, or a replacement to dependency embeddings, we experimented with the UMAP dimension reduction technique (McInnes et al., 2018). We trained UMAP as a supervised learning approach feeding dependency embeddings along with class labels and reduced dependency path embeddings to 2 dimensions.

Training the DGCNN model
Input (A, I) was used to train an off the shelf implementation of Deep Graph Convolution Neural Networks (DGCNN) 6 with CrossEntropyLoss 7 as its loss function and with class weights calculated 5 https://github.com/aditya-grover/ node2vec 6 https://github.com/muhanzhang/ pytorch_DGCNN 7 https://pytorch.org/docs/stable/ generated/torch.nn.CrossEntropyLoss.html by 1/total num datapoints within class. We trained the system for 6 epochs with a batch size of 100 in a cuda environment with 6 output classes to predict.
Class labelling: Center points (t 1 , t 2 ) are predicted to fall into one of six classes: Class 0: t 1 is not part of a gold Quantity (Q) span 8 ; Class 1: t 1 is within a Q span but t 2 is not within any gold span; Class 2: t 1 and t 2 are in the same Q span (e.g. 5 kg); Class 3: t 1 is within a Q span and t 2 is any token within the ME span belonging to the same annotation set as t 1 ; Class 4: t 1 is within a Q span and t 2 is any token within the MP span belonging to the same annotation set as t 1 ; Class 5: t 1 is within a Q span and t 2 is any token within the QL span belonging to the same annotation set as t 1 .

Phase-III: Postprocessing
The six classifier classes predict relations between two tokens, while the gold standard annotates relations between text spans. The required mapping to competition output requires several postprocessing steps.
Prediction ranking When multiple t 2 with the same class are predicted for a particular t 1 we choose t 2 with the highest probability. 9 Span mapping For each prediction (t 1 , t 2 ) ∈ ClassY we record the token offset for t2 as span for ClassY in our system output unless the BERT model from (Therien et al., 2021) finds a neighbouring token with the same predicted class, in which case they are merged into the same span.
Detecting units Once quantity spans are determined, any measurement gazetteer entry within it was labelled as a unit.
Predicting quantity modifiers We used another pretrained BERT model from (Therien et al., 2021) to predict modifiers for each quantity span.

Results and Analysis
We split the combined training and trial datasets randomly into 75% training and 25% validation set resulting in 233 and 80 documents in our training and validation set respectively.
We experimented with different preprocessing and feature representations using the official Meas-Eval evaluation script. DGCNN training parameters were fixed for all the experiments. Table 1 shows development results, the first row (in italics) shows the competition system. The first column (T) indicates the influence of the list and interval protection step: c indicates it is included, n indicates it is not included. We observe that not including it returns slightly better results, offset by a high rate of duplicates 10 .
We experiment with different path length limits (column 4: H). While a length limit of 8 showed better results on the development data than our competition limit of 5, the same is not true for the test data (see Table 2), where there is no equivalent recall gain.   Table 2 shows competition and post-competition results on test data. The initial competition system is in italics, in bold are the revised results after the organizers removed duplicates.  Results on the test data are significantly lower, indicating overfitting. Post-competition ablation in Table 3 shows that UMAP, for instance, was not effective. Overlap determined competition rankings.   Analysing performance for the different labels in Table 4 shows that our system is not yet mature and needs adjusting. The potential of the DCGNN for the tasks is demonstrated by the high results for quantity (Q) and acceptable results for units (U), which are in line with stronger systems. The comparatively low performance for measured entities (ME) and measured properties (MP) demonstrates that the multi-class labelling approach needs better support. We will consider approaches from the literature (Yao et al., 2018), (Sun et al., 2019), (Hong et al., 2020), (Gupta et al., 2016).
HasQuantity and HasProperty received no attention during development and consistently scored 0 for runs with UMAP, see Table 4.
Combined tokens: When CD(,CD),* and CD is followed by respectively, each list item corresponds to a different entity/property (e.g. This compares to signatures of accelerated electron precipitation from peaked electrons and "inverted-V" electrons, which occur on 9.8 and 3.4% of MEX orbits, respectively.) As per gold annotation, 9.8 and 3.4% are 2 different Qs both with "MEX Orbits" as ME and "signatures of accelerated electron precipitation" as MPs. We generate 9.8 and 3.4% as a single Q and "Signatures" as ME, leading to several false negatives. Documents S0032063312003054-2458, S0016236113008041-3257, S0378112713005288-1916 caused this error. Math equalities Our system does not protect math environments such as ...tetragonal unit cell with a=4.1816(4)Å and c=10.0322(6)Å... in document S0022459611006116-1351. Consequently, a is labelled as a determiner (DT). As determiners are not part of gold labels, a is eliminated in postprocessing when it should have been labelled as an MP in this case and also in document S0022459611006116-1257.

Duplication of Quantities
Lists of quantities including units are not combined with our custom tokenizer, thus for for 4.5 kg and 6 kg samples in document S0016236113008041-3257, we stipulate two different quantities. This results in duplication of quantities in our submission. 11 Units The gold standard annotates for instance thin shale barriers (S1750583613004192-1126), das (S037842901300244X-1654), %∆E/E (S0301010413004096-767), and KLoC (S016412121300188X-3207) as units. These are not included in our gazetteer list of 4028 units, incurring false negative errors.

Conclusions
The MeasEval task is a challenging task that can benefit from a variety of tools in a well integrated system. The small data size limits a true appreciation of the challenges involved but ablation studies suggest that tokenization variations influence precision and recall differently and should be carefully considered in application systems. Also, a umap reduction suggested a ca 1% performance boost on validation data, but incurs a ca 5% loss on test data, showing signs of overfitting. The subgraph classifier proved effective only for quantity and unit prediction. Ablations show that larger subgraphs and longer paths lead to performance degradation, making a case for more task oriented locality features. Duplicates have to be removed.
While the unusual complexity of the classification task and the limited size of the dataset prohibits very general conclusions, we showed that DCGNNs offer an interesting way to encode dependency information but that it has to be supported by several domain inspired contributions to work for all task components effectively.

Acknowledgments
We gratefully acknowledge Benjamin Thérien and Parsa Bagherzadeh for their help. The work was supported by NSERC.