Masked Reasoner at SemEval-2020 Task 4: Fine-Tuning RoBERTa for Commonsense Reasoning

This paper describes the masked reasoner system that participated in SemEval-2020 Task 4: Commonsense Validation and Explanation. The system participated in the subtask B.We proposes a novel method to fine-tune RoBERTa by masking the most important word in the statement. We believe that the confidence of the system in recovering that word is positively correlated to the score the masked language model gives to the current statement-explanation pair. We evaluate the importance of each word using InferSent and do the masked fine-tuning on RoBERTa. Then we use the fine-tuned model to predict the most plausible explanation. Our system is fast in training and achieved 73.5% accuracy.


Introduction
Commonsense validation and explanation is a critical area in natural language understanding. Recent research advances (Yang et al., 2019;Devlin et al., 2018; pushed the bar in this area to a new height. SemEval 2020 Task 4 (Wang et al., 2020) is focused on this area. It has 3 subtasks. Subtask A is focused on commonsense validation. It gives two statements. One of them is against commonsense. For example, 1) He poured orange juice on his cereal.
2) He poured milk on his cereal. The system needs to know which one is against commonsense. Subtask B is what we participated in. It first gives a statement, which is wrong or not proper. Then it lists 3 explanations. One of them can explain why the statement is not proper. For example, statement: He poured orange juice on his cereal. Explanations: 1) Orange juice is usually bright orange.
2) Orange juice doesn't taste good on cereal.
3) Orange juice is sticky if you spill it on the table. The system needs to select it out of the three. Subtask C, in our opinion, is more challenging. The system needs to generate an explanation that explains why a statement is against commonsense. For example: Statement: He put an elephant into the fridge. Possible Valid Reasons: 1) An elephant is much bigger than a fridge.
2) A fridge is much smaller than an elephant.
3) Most of the fridges are not large enough to contain an elephant. BLEU (Papineni et al., 2002) is used to judge whether the generated reason is a valid one. Subtask B is very similar to other famous datasets, such as CommonsenseQA (Talmor et al., 2018) and COPA (Roemmele et al., 2011). Many previous research (Talmor et al., 2018;Lin et al., 2019) showed that pre-trained language models can implicitly hold knowledge thus are able to answer these questions well. By fine-tuning based on these pre-trained models, many datasets archived new heights. Our system fine-tuned in a novel way. We argue that whether a explanation fits a statement is strongly correlated to the 'core' of the statement, which should be the most important word. We first use InferSent to locate the most important word of the statement. Then we mask that word and start fine-tuning RoBERTa. By concatenate the proper explanation to the statement by adding a few concatenate words such as ' is wrong because ', we can get a full-text format. After the fine-tuning, we can apply the model to the test set. We provide our code on GitHub for reproduce purpose: https://github.com/daming-lu/Code-for-SemEval2020-Task4.
2 System Description 2.1 Problem Abstraction Suppose we have an input statement s = w (1) , w (2) , . . . , w (Ls) , where each w (i) is a word. We also have a set of candidate explanations: we aim to identify the most proper explanation e * ∈ E which can explain why statement s is improper. L s is the length of the statement. The candidate explanations can have various lengths.

Core Word
In InferSent (Conneau et al., 2017), the authors focus on getting sentence embeddings that hold semantic information that are useful in natural language inference. We find that we can find the most important word, a.k.a the 'core' word in the statement, by representing the statement. We use both the statement and the candidate explanations as the corpus to build a relatively small vocabulary. The semantic information that implicitly stored in InferSent can help pinpoint the core word in the statement. See Figure 1 for more details.

Sentence Evaluation
For each pair of s, e i , we first build its full-text format. We simply add ' is wrong because ' to concatenate s and e i . For example, the 'cereal' example mentioned in Introduction will look like below: He poured orange juice on his cereal is wrong because orange juice doesnt taste good on cereal.
We then mask the core word, i.e. 'cereal', in the statement and train the model to try to recover the masked word. Intuitively, the higher confidence the model has in recovering the masked word, the more plausible the explanation e i can best explain the statement s. We denote Sen as the concatenated full-text sentence. Following the notation in (Song et al., 2019), we note Sen \w i as the sentence Sen i with the word w replaced by the [MASK] token. The sentence evaluation formula is as follows: where k is the index of core word. Masked word probability is estimated from a direct calculation on the pre-trained masked language model, in our case, RoBERTa. See Figure 2 for more information.

Data
Trial data were given out in practice period. For Subtask B, trial data has 2,021 rows. Then 10,000 training data were released. Each has 1 improper statement and 3 candidate explanations. Follow concatenating method mentioned above, we have 30,000 training data for our system, 10,000 are positive and 20,000 are negative. We have 997 dev data that can be used as testing data during the practice period. Both trial, training and dev data have gold answers. Then in the evaluation period, the real test data were released whose gold answers are never revealed. See Table 1 for more information.  During the training period, our system reached 83.21% accuracy after 3 epochs, 85.22% after 6 epochs and 87.43% after 10 epochs. We trained on 2 Nvidia GEFORCE GTX 1080 Ti GPUs. We tried both RoBERTa-base and RoBERTa-large. We set hidden size to be 768 for RoBERTa-base and 1024 for RoBERTa-large. We used a batch size of 4 and set max learning rate to be 1e-5. The hidden state dropout percentage is 5%. One epoch takes about 10 minutes. More details can be accessible in our github repo. We got a 73.5% accuracy for the test dataset, which makes us rank 21st out of the 30 teams.

Discussion
There is a big drop in terms of accuracy between training and testing. We doubt it is due to over-fitting. Our system gain very little after 3 epochs so we should stop earlier to make the system more general-purpose. Despite over-fitting, our best accuracy, 87.43% still cannot enter the top 10 on the leader board. We suspect that it is because we did not use any external knowledge base. Some large external knowledge base such as ConceptNet (Speer et al., 2017), could help a lot.

Conclusion
We proposed a novel method for fine-tuning pre-trained model and fit it into Subtask B. The combination of InferSent and RoBERTa can make the masked training faster. Meanwhile, (Tamborrino et al., 2020) introduced a more thorough way of masking every word as well as N-grams in both the statement and the explanation. We think that only the important words need mask. Trivial words such as 'he', 'the', 'to' could be noise.