DPR at SemEval-2021 Task 8: Dynamic Path Reasoning for Measurement Relation Extraction

Scientific documents are replete with measurements mentioned in various formats and styles. As such, in a document with multiple quantities and measured entities, the task of associating each quantity to its corresponding measured entity is challenging. Thus, it is necessary to have a method to efficiently extract all measurements and attributes related to them. To this end, in this paper, we propose a novel model for the task of measurement relation extraction (MRE) whose goal is to recognize the relation between measured entities, quantities, and conditions mentioned in a document. Our model employs a deep translation-based architecture to dynamically induce the important words in the document to classify the relation between a pair of entities. Furthermore, we introduce a novel regularization technique based on Information Bottleneck (IB) to filter out the noisy information from the induced set of important words. Our experiments on the recent SemEval 2021 Task 8 datasets reveal the effectiveness of the proposed model.


Introduction
One of the key indicators of scientific writing is the quantities description of various experiments and results. While the mentions of all measurements could provide a rigorous understanding of the topic, it might make the reading and automatic processing of the text more difficult. As such, designing effective methods to recognize the mentions of measurements and also the conditions in which they are valid is necessary. According to the definition of the SemEval 2021 Task 8 (Harper et al., 2021), a measurement might consist of the following components: (i) Measure Entity: A span referring to an entity that one of its properties has been measured and its value is provided in the document; (ii) Measured Property: A span referring to the characteristics of an entity that has been measured; (iii)  Quantity: A span in the document that refers to a value and possibly it comes with a unit; and (iv) Qualifier: A span referring to a condition in which more information about the Quantity, Measured Property or Measured Entity is provided. Figure  1 shows a sample document annotated with the aforementioned entities. In this paper, we collectively name all of these four types as measurement component.
As it is shown in the provided example, documents might contain multiple entities, properties, quantities and qualifiers that are scattered in different parts of the document. As such, finding which measurement components are associated with each other is not straightforward. In this paper, this task is called measurement relation extraction (MRE) that aims to recognize what is the relationship between two given measurement components. More specifically, the following relation types are considered: (i) Has-property: Indicates the selected property is one of the characteristics of the selected entity; (ii) Has-Quantity: Indicates the selected quantity is provided for the selected entity or property; (iii) Qualifies: Indicates the selected qualifier provides more information about the selected entity or quantity; (iv) None: Indicates that there is no relation between the selected measurement components. For instance, in the given example document in Figure 1, the following relations between different measurement components exist: (1) ME 1 has property PR 1 ; (2) PR 1 has quantity QT 1 ; (3) ME 2 has quantity QT 2 ; (4) ME 3 has quantity QT 3 ; and (5) QL 1 qualifies ME 3 ; Finding the relation between a pair of measurement components is challenging and it requires consideration about the position of the given entities and the context in which they are used. Generally, this task can be formulated as a typical Relation Extraction (RE) task whose goal is to identify the semantic relation between two given named entity mentions. For RE, it has been shown that contextual information such as dependency path between the two given entities is important. As such, in this paper, we also aim to exploit the contextual information for a pair of measurement entities to predict the relation between them. To this end, the main question to answer is how we can extract the contextual information that is helpful for this task. One simple solution is to use the dependency path between the two measurement components. However, this might not be perfect due to various reasons such as lack of high-quality dependency parser designed especially for scientific domain and the fact that the dependency tree is ignorant of the downstream task (i.e., MRE) thus might not be efficient to extract important context from. Therefore, in this paper, we aim to propose a novel method to dynamically infer the important context for the MRE task. More specifically, we introduce a deep architecture to infer which words should be selected from the given document to form the important context from which the relation between the given measurement components can be inferred. The proposed deep architecture exploits a translation-based perspective to achieve this goal.
In addition, in this paper, we propose a novel method to efficiently regularize the representations of the input words based on the inferred important context. In particular, our method is based on the Information Bottleneck (IB) theory in which the inferred context is treated as information bottleneck to exclude noisy information in the input document representation. We conduct extensive experiments on the SemEval 2021 Task 8 dataset. Our experiments reveal the effectiveness of the proposed model for the task of MRE.

Model
Task Definition: The input to the model is the document D = [w 1 , w 2 , . . . , w n ] consisting of n words and also the positions of the two entities of interest, w s and w o where s and o are the indices of the first (i.e., subject) and the second (i.e., object) entities, respectively. The input document is annotated with the label l from the set L = {hasQuantity, hasProperty, qualifies, None}. Our proposed model for this task consists of four major components: (1) Input encoder to convert the input text into high dimensional word vectors; (2) Dependency Path Reasoning: This component employs the word vector representations and extract a path between the two entity mentions in the given document; (3) Regularization: This component employs the extracted dependency path as the information bottleneck to filter out noisy information from the input document; (4) Prediction: Finally the regularized representations of the dependency path will be used to make the final prediction. The rest of this section provides details for the aforementioned components.

Input Encoder
To represent each word w i in the input document D, we use the concatenation of the following components: Contextualized Embedding, We feed the input document D, i.e., [CLS]w 1 w 2 . . . w n [SEP ] to the pre-trained BERT base transformer and take the hidden states of the last layer of the BERT model, i.e., E = [e 1 , e 2 , . . . , e n ], as the contextualized word embedding of the input document. Note that for the words that have multiple wordpieces, we take the average of their word-piece embeddings obtained from the BERT model. Position Embedding For each word w i , we compute its distance to the subject w s and the object w o , i.e., respectively. The distances are represented using high dimensional vectors e s i and e o j obtained from randomly initialized embedding tables. During training, the embedding tables are being updated. Entity Type Embedding The type of the two entities (i.e., Quantity, Measured-Entity, Measured-Property, and Qualifier) are represented using high dimensional vectors obtained from randomly initialized embedding tables. The embedding tables will be fine-tuned during training.
The concatenation of the aforementioned embedding vectors, i.e., X = [x 1 , x 2 , . . . , x n ], are used to represent the words of the input document. It is noteworthy that since the parameters of the pretrained BERT base are fixed during training, in order to tailor the contextualization of the word embeddings to this task, we feed the vectors X to a Bidirectional Long Short-Term Memory (BiLSTM) network and we use the hidden states of the BiL-STM neurons, i.e., H = [h 1 , h 2 , . . . , h n ], as the final vector representations of the input document D. The vectors H will be used by the subsequent components.

Dependency Path Reasoning
To find the dependency path between the subject and the object entities, we employ a translationbased perspective. More specifically, given the vector representations of the subject entity, i.e., h s , and the object entity, i.e., h o , the dependency path should be represented using the vector P such that using this vector, the subject representation h s is transferred (i.e., translated) to the object representation h o , under the operation Φ. Formally, h o = Φ(h s , P ). Using this definition, we can define the path representation by P by exploiting After obtaining the path representation P , we compare it with the representations of the other words of the document D to assess their likelihood to be included in the dependency path. Concretely, the similarity between the vector h i and the vector P could be used to estimate the probability of the word i to be used in the dependency path. However, one limitation of this method is that the likelihood of the word w i is computed regardless of the other words w j where j / ∈ {i, s, o}. To address this issue, we propose to compute the likelihood of the word w i based on the interaction between the representation of the word w i , i.e., h i , the representations of the other words, i.e., h j for j / ∈ {i, s, o}, and the path representation P . To this end, we first compute a vector representation for the words w j by applying M AX P OOL operation on all words w j for j / ∈ {i, s, o}:h −i = M AX P OOL(h 1 , h 2 , . . . , h j ). Afterwards, we apply the function Φ −1 on the vectors P andh −i : h i = Φ −1 (h −i , P ). The vectorĥ i represents the path for transferring (i.e., translating) the vector h −i to P . As such, the similarity betweenĥ i and h i could reveal how important is the word w i to convert the representation of the context w j for j / ∈ {i, s, o} to the representation of the depen-dency path P . Therefore, we use this similarity, i.e., Sim i = ĥ i − h i , as the score of the word w i to be included in the dependency path. The words that their score is above a pre-defined threshold will be used as the inferred dependency path. It is worth noting that to learn the function Φ −1 , in this work, we use a feed forward neural network. In particular, the concatenation of the vectors h s and h o are fed into a 2-layer feed forward neural network with |P | neurons at the final layer: represents concatenation and F F represents the feed-forward neural network. To train the F F network for the RE task, we use the vector P to predict the probability distribution P Φ (·|D, t, a) using another feed-forward network F F 2 whose final layer dimension equals the number of labels, i.e., |L|. We use negative log-likelihood to train the F F and F F 2 networks: L Φ = −log(P Φ (l|D, t, a)) where l is the gold label.
Finally, to represent the induced path, we take the max-pooled representation of the words in the path: h P = M AX P OOL (h 1 , h 2 , . . . , h p ) where p is the number of words in the induced dependency path. The path representation h p will be used by the subsequent components.

Regularization
Although the induced dependency path from the previous component is intended to contain the important information for the RE task, it might still contain some noisy information due to the contextualization in the input encoder. To overcome this noisy information, in this work, we propose to exploit the induced path as the information bottleneck (IB) (Tishby et al., 2000). IB's goal is to reduce the mutual information between the input and the bottleneck, meanwhile, to increase the mutual information between the bottleneck and the output. For the second goal, the bottleneck (i.e., the dependency path representation h p ) will be used by the prediction component, and the increase of its mutual information with the output is enforced by reducing the training loss (e.g., negative log-likelihood). To fulfill the first goal, i.e., decreasing the mutual information between the input and the bottleneck, we resort to a contrastive learning paradigm to estimate the mutual information between two highdimensional vectors using the classification loss of a binary-discriminator. More specifically, the path representation h p is concatenated with the max-pooled representation of the input document D, i.e., h d = M AX P OOL(h 1 , h 2 , . . . , h n ), and this concatenation, i.e., h pos = [h p : h d ], serves as the positive sample for the contrastive learning. To construct the negative samples, we first take the max-pooled representation of a randomly chosen document D from the same minibatch, i.e., h d = M AX P OOL(h 1 , h 2 , . . . , h m ) where h i is the representation of the i-th word in the document D and m is the total number of words in D . Afterwards, the concatenation of h p and h d is employed as the negative sample: Finally, a feed-forward discriminator is employed and trained to distinguish the positive samples from the negative ones, i.e., L disc = log(1 + e (1−D(hpos)) ) + log(1 + e D(hneg) ). By adding the discriminator loss L disc to the final loss function and decreasing it, the estimated mutual information between the input and the bottleneck (i.e., the path representation h p ) is decreased too.

Prediction
To make the final prediction on the relation between the given subject and object entities, we employ the representations of the induced dependency path (i.e., h p ), the subject entity (i.e., h s ), and the object entity (i.e., h o ) to construct the fi- represent concatenation. The vector V is finally consumed by a feed-forward neural network to predict the distribution P (·|D, t, a). The loss function to train the main RE task is thus defined as: L pred = −log(P (l|D, t, a)) where l is the gold label. The overall loss function to train the entire model is: L = L pred + αL Φ + βL disc where α and β are the trade-off parameters.

Dataset, Hyper-Parameters & Baselines
In order to demonstrate the effectiveness of the proposed model, i.e., Dynamic Path Reasoning (DPR), we evaluate it on the recent SemEVal 2021 Task 8 dataset. This dataset provides measurement annotation for 233 training documents, 65 development documents, and 130 testing documents, all in English. Note that we do experiments only on the train and trial set (as the gold entities are not available for test set). Also, we evaluate the model only for relation extraction, not the entire task (as such, we did not make a submission during MeasEval evalu-ation phase). More specifically, for each document, the positions of the measured entities, measured properties, quantities, and qualifiers are provided. Furthermore, for each measurement component, its relations with the other components or extra information (e.g., unit of quantity) is available. Note that in our experiments, we do not use the annotation set information which indicates which components belong to the same measurement.
We fine-tune the hyper-parameters of the proposed model on the development set of the Se-mEval 2021 Task 8 dataset. The model with the best performance on the development set is evaluated on the test set. Based on our experiments, the following hyper-parameters are selected: 50 dimensions for the position embedding and entity type embedding; 200 dimensions for the hidden layer of the BiLSTM and all feed-forward networks; 0.1 and 0.05 for the trade-off parameters α and β; 0.7 for the threshold in the dynamic path reasoning component; Adam optimizer with learning rate 0.3; batch-size 50; and early stopping with the patience of 10.
To comprehensively evaluate the proposed model, we compare its performance against the following baselines: (i) Sequential Models, specifically we compare with BiLSTM which takes the non-contextualized word embeddings of the input document (i.e., GloVe) and encode the sequence of the words. Moreover, we also compare with BERT model fine-tuned during training for the MRE task.
(ii) Structure-aware models, these models employ the structure of the input document (e.g., dependency trees of the sentences). Specifically, we compare with iDepNN (Gupta et al., 2019) which employs the dependency trees of the sentences of the document. This baseline adds an edge between the roots of the trees to create a connected graph, Furthermore, it prunes the tree along the dependency path between the two entities of interest. Finally, we compare our model with LSR which dynamically infer a graph structure for the input document using the representations of the entities and other words on the dependency path between the entities.

Results
The results on the test set are presented in Table 1. There are several observations from this table. First, the proposed model significantly (with p < 0.01) outperforms the baselines. It indicates the importance of using dynamic path reasoning and also the proposed regularization method. Second, Comparing the structure-aware and sequence-based baselines, it is evident that the structure of the input document is necessary for achieving better results. However, between the iDepNN and the LSR baseline, the latter has better performance due to its capability of inferring the structure of the document instead of relying on external parse trees as in iDepNN. Finally, this experiment shows that using the pre-trained language model BERT substantially improves the performance compared to a sequencebased model that utilizes GloVe embedding. This is on par with the recent advancement on NLP using contextualized word embeddings.

Ablation Study
In this section, we provide more insight into the effectiveness of different components of the proposed model. The major two components in our model are dynamic path reasoning and regularization. To study their importance, we evaluate the performance of the following baselines on the development set of the SemEVal 2021 Task 8 datasets: (i) Full −DP R , this baseline completely removes the dynamic path reasoning component. More specifically, the vector h p is removed from the final prediction vector V and the loss function L Φ is also removed from the overall loss function L;  are necessary to achieve the highest performance. More specifically, the dynamic path reasoning has the highest impact on the performance as removing it will hurt the most. Also, it shows that the consideration of the context to compute the score for each word to be included in the induced path is necessary. Finally, it shows that regularization is helpful for exclude noisy information from the input. More interestingly, replacing the IB with a dot product to enforce the regularization hurts more than removing the regularization itself. It indicates the necessity of using IB for regularization.

Conclusion
We proposed a new model for the MRE task. The introduced model employs a dynamic path reasoning component which induces important context words to predict the relation between two measurement components. Furthermore, we proposed a novel regularization method based on Information Bottleneck to exclude noisy information from the input. Our experiments on the SemEval 2021 Task 8 reveal the effectiveness of the proposed model.