LIORI at SemEval-2021 Task 8: Ask Transformer for measurements

This work describes our approach for subtasks of SemEval-2021 Task 8: MeasEval: Counts and Measurements which took the official first place in the competition. To solve all subtasks we use multi-task learning in a question-answering-like manner. We also use learnable scalar weights to weight subtasks’ contribution to the final loss in multi-task training. We fine-tune LUKE to extract quantity spans and we fine-tune RoBERTa to extract everything related to found quantities, including quantities themselves.


Introduction
SemEval-2021 Task 8 consisted of five subtasks that covered span extraction, classification, and relation extraction tasks. This paper presents solutions to all five of them which showed the best results in the competition 1 .
In the subtask 1(A) participants were asked to retrieve Quantity (Q) spans from texts. For example, in the following text "The soda can's volume was 355 ml.", the system should retrieve "355 ml" as Q span. The rest of the subtasks were to extract information related to retrieved Quantities (Qs) from subtask A.
The subtask 2(B) was to extract the Unit of measurement (UoM) of the extracted Q and also to classify it into 10 classes: HasTolerance, IsApproximate, IsCount, IsList, IsMean, IsMeanHas-Tolerance, IsMeanIsRange, IsMedian, IsRange, Is-RangeHasTolerance. It should be noted that some 1 https://github.com/davletov-aa/meas-eval Qs could be related to more than one type and there were ones which didn't belong to any type. The subtask 3(C) was to extract Measured Entity (ME) and Measured Property (MP) spans. In the subtask 4(D) additional Qualifier (Qlfr) spans, which helped to validate or understand the extracted Q, were asked to be extracted. And finally, subtask 5(E) was to extract relations between Qs, MEs, MPs and Qlfrs.
More detailed information about the competition could be found in the Harper et al. (2021)'s shared task description paper.

Related Work
Span extraction and classification problems have a long history of studies and are often studied as a part of Named Entity Recognition (NER). For example, the NER dataset Ontonotes v5 (Weischedel et al., 2013) contains such entities as "Quantity", which also includes measurements, and "Money". However, the general NER approach used in Ontonotes or ConLL 2003 (Sang andDe Meulder, 2003) datasets is not so fine-grained as the one that is used in the task under study.
Most state-of-the-art models for named entity recognition and relation extraction are based on Transformer architecture by Vaswani et al. (2017). For example, the top three best models for Ontonotes v5 according to paperswithcode.com use BERT 2 . BERT is a large pre-trained language model based on Attention (Devlin et al., 2019). BERT has a unique training procedure where the model is trained using Masked language objective, where some tokens are replaced with a special ' [MASK]' token and the model should predict the original token. BERT also had an additional training objective -the model had to predict whether a sentence was random or it followed the first sentence. However, some papers have investigated that BERT is undertrained and that training BERT on more data and for a longer time might increase model performance. RoBERTa was one of the first and more influential papers of such kind (Liu et al., 2019). RoBERTa modifies BERT's pretraining procedure. The RoBERTa model is trained longer, with bigger batches over more data and on longer sequences. RoBERTa's authors have also found that removing the next sentence prediction objective from BERT matches or slightly improves BERT performance. Researchers have also suggested ways of leveraging the nature of the task and adding some problem bias to named entity recognition. Among such works, which is currently the best performing for Ontonotes v5 and CoNLL 2003 according to paperswithcode.com, is LUKE (Yamada et al., 2020). The authors of LUKE have added a new language modeling task that consists of predicting randomly masked words and entities in an entity-annotated corpus retrieved from Wikipedia. The authors have also expanded the selfattention mechanism to entity types and consider entity types when computing attention scores. The proposed approach allowed the authors to achieve state-of-the-art results not only for named entity recognition but for a bunch of other unrelated tasks such as SQuAD1.1 question answering.
For relation extraction, Transformer-based models also outperform other approaches. A promising approach is treating relation extraction as a question answering problem. Among works implementing this approach, we can mention (Cohen et al., 2020) where the authors restructured relation classification as a Question answering (QA) like span prediction problem. It allowed them to get stateof-the-art results for TACRED and SemEval 2010 task 8 datasets.

Data
The data provided by the organizers contained plain text files and their annotations in tsv format. There were in total 248 training texts, 65 trial ones, and 135 for the evaluation phase. There were 2764, 897, and 1620 annotated entities respectively. The files have been approximately equally distributed among several domains: Agriculture, Astronomy, Biology, Chemistry, Computer Science, Earth Science, Engineering, Materials Science, Mathematics, Medicine. Entities could have been labeled into 5 classes: Q, ME, MP, UoM or Qlfr.
As input data in the competition was in the form of plain text extracts, we first split them into sentences using PunktSentenceTokenizer and Punkt-Trainer from NLTK library (Bird et al., 2009). We trained PunktTrainer on texts from the training set. We did data augmentation by including text extracts consisting of two sentences following each other for each text document. So if we had original sentences [s 1 , s 2 , s 3 , s 4 ] we get an augmented set of texts [s 1 , s 2 , s 3 , s 4 , s 1 s 2 , s 2 s 3 , s 3 s 4 ]. Then we split each example into tokens using Regexp-Tokenizer from the NLTK library with the following \w+|\(|\)|\[|\]|[-{.,]|\S+ regular expression. We used the train set for training and the trial set for development.
Also, we relabeled Qualifier to QuantityQualifier (QQ), MeasuredEntityQualifier (MEQ), and MeasuredPropertyQualifier (MPQ). By this little trick we solved the problem with examples having multiple Qlfrs corresponding to either Q, ME, or MP.

QuAnt System
The architecture of the QuAnt 3 is shown in Figure 2. As could be seen, our model uses the RoBERTa model to extract features for each example. It solves all subtasks of the competition in a multitask question answering way. We ask our model to predict all BPE subword-level start and end positions of spans (answers) related to Q (question). We ask the model by inserting special tokens "•" and "/" around Q. Also, the model makes multiclass multi-label predictions regarding the type of the Quantity (QT).
It takes text extracts containing some Qs and positions of the Q regarding which it should make predictions.

Extract Quantities
So, our approach needs quantity span information as input. And to get that information we went with fine-tuning the LUKE model (Yamada et al., 2020) on the NER task to predict Q spans. We used the code provided by the authors of the model. We trained it on the augmented dataset in BIO format with the following hyperparameters: maximumentity-length, maximum-sequence-length, learning rate, and batch size were set to 64, 256, 1e-5, and 4 respectively. We trained two models with the weight decay hyperparameter equal to 0.1 and 0.01 for 10 epochs. We were validating our models four times per epoch on the development set and saving the top 3 best checkpoints during training, resulting in a total of 6 models.
So after two training runs, we got 6 trained models. Using the development set we chose the best combination of them for a simple word-level voting ensemble.

Extract Everything
During training and validation, we use Q spans from the annotated set. During test prediction, we use spans predicted by the ensemble of quantity extractors from the previous section. Because of our test time augmentation process, we had been able to get up to three entries per each Q: for the sentence containing it and for it with either its left or right context sentences.
We split tokenized examples into byte-pairencoding (BPE) subwords with RoBERTaTokenizer which resulted in the following RoBERTa inputs marked by symbols "•" and "/" Qs: To vectorize QT, we use the output from the last layer corresponding to [CLS] token. And to predict start and end probabilities for each subword of each span type we use outputs from the last layer. We feed them to linear layers to predict QTs and span starts and span ends.
During training we optimize the following weighted loss: , where bs -batch size, [w QT ; w ST |ST ∈ ST s] = sof tmax([w qt ; w st |st ∈ ST s]) -learnable weights vector initialized with ones, QT ione-hot encoded QT of i-th example (which could be zero vector in some cases and will not contribute during training),QT i -predicted QT probability distribution, ST s -set of following span types: [Q, ME, MP, Qlfr, UoM, QQ, MEQ, MPQ], ST i start -one-hot encoded start position of the corresponding span.ŜT i start -predicted start positions probability distribution. The same goes for ST i end . We trained our model without adding an optional question prefix to RoBERTa inputs. We used hyperparameters from Table 1.
As our test predictions include duplicated predictions for the same Q due to the test time augmentation, we remove identical predictions. Worth noting, that there still might be duplicates left in the case of different extracted values for the same Quantity. Because of this, our submitted results are higher than the results without test time data augmentation. So, our model takes Q with its context as input and predicts its type and extracts various spans. For all of the subtasks except the subtask E we treated extracted answers as is. For the subtask E we used the following rules to extract relations between Q, MP, ME and Qlfr (QQ, MEQ, MPQ):

Experiments and Results
In this section, we report the results of our postevaluation experiments. First, we experimented with base models. We tried different subtask weighting strategies. As we solve the task in a multi-task way, we need to aggregate the losses of each subtask to optimize the final loss. And here, we tried to just average them (equal) or take the weighted sum using learnable weights (softmax, rsqr+log) vector W with the length equal to the number of training subtasks. In the case of softmax weighting strategy we just use softmax over the vector W. In the case of rsqr+log, we divide each subtask's loss to its squared learnable weight and sum with the logarithm of it. This approach of weighting subtasks in multi-task learning was introduced by Kendall et al. (2018).
We also experimented with data augmentation. But unlike experiments we did in the evaluation period, here we didn't do test time data augmentation.
Also, we tried to concatenate the question prefix to an input example. We experimented with prefix Find measured entities and properties of marked quantity. We hoped it could give extra information to the model regarding the nature of the answer. Table 2 shows the best results for the development dataset and Table 3 shows corresponding results for the test dataset. Also, there are our official submission results. Table 2 shows that training time data augmentation improves the overall score. Also, we could see that including prefix question did not improve the overall scores of the models which use data augmentation. Yet we see the opposite picture for the test set in Table 3. It can also be seen that RoBERTa-large not necessarily outperforms RoBERTa base.
We see that using just the average sum of subtask's losses demonstrates the best results.
We also tried to fine-tune the large version of XLM-R with the best hyperparameters from base   models. In the case of large models, again, equal weighting scheme demonstrated the best result. In all our post-evaluation experiments we used the same settings as in Table 1. We tried learning rates from [5e − 5, 1e − 4, 2e − 4] and batch sizes from [32,64,128].

Conclusion
In this paper, we introduced our solution to SemEval-2021 Task 8: MeasEval: Counts and Measurements. Our approach was based on RoBERTa and LUKE models. We show that extracting mea-surements from a text can be treated as a questionanswering task. In this work, we tried a set of different models, hyperparameters, and weighting schemes and present their effect on the final result.