Ferryman as SemEval-2020 Task 5: Optimized BERT for Detecting Counterfactuals

The main purpose of this article is to state the effect of using different methods and models for counterfactual determination and detection of causal knowledge. Nowadays, counterfactual reasoning has been widely used in various fields. In the realm of natural language process(NLP), counterfactual reasoning has huge potential to improve the correctness of a sentence. In the shared Task 5 of detecting counterfactual in SemEval 2020, we pre-process the officially given dataset according to case conversion, extract stem and abbreviation replacement. We use last-5 bidirectional encoder representation from bidirectional encoder representation from transformer (BERT)and term frequency–inverse document frequency (TF-IDF) vectorizer for counterfactual detection. Meanwhile, multi-sample dropout and cross validation are used to improve versatility and prevent problems such as poor generosity caused by overfitting. Finally, our team Ferryman ranked the 8th place in the sub-task 1 of this competition.


Introduction
Counterfactual refers to a conditional statement in which the first clause is a past tense subjunctive statement expressing something contrary to fact, as in "If I had studied harder, I might have passed the exam". Sometimes it can be used in practice to support the assumption that if events occurred differently in the past, what might have happened (Gentner and Yeh, 2005), as in "If Rose had not accepted doctor's advice to cut the tumor, she might have been dead". Since Rose is still alive, and it is impossible to come back to ask Rose to reject the advice. While in other cases, as in " If a drop in crude oil prices had been factored out, the stock price would not be so low", it is not the counterfactual that the drop in the price of crude oil, more than anything else, caused the stock market to plummet. Judging counterfactual is hard for the reason that it involves so many aspects of life such as economy, medical treatment. Besides, the exploration of causation is necessary.
Reasoning is a very important and challenge task in the fields of the Natural Language Process and aims to analyze the internal causality of the sentence to find out whether the sentence is correct or not. In the causal relationship, the cause is partly responsible for the result, while the result partly depends on the cause. There is an inherent causal relationship among objective things. Through grasping the causal relationship of things, people can understand the nature of things comprehensively. As an important research direction of causal reasoning, counterfactual reasoning has been employed in many cognitive processes.
According to the order, counterfactual thinking can be divided into upward counterfactual thinking and downward counterfactual thinking. The former one is the assumption that something has happened in the past, that if certain conditions are met, there is possibility that there will be a better outcome than the real outcome. For example, "If we had been on the pitch before the game, we would not have lost the game". The latter one means that the alternative result is worse than the real one, like "Fortunately, I went to the field for adaptation training before the game, otherwise I would have lost the game today". Through counterfactual reasoning, people can not only have the joy of success, the regret of failure and other emotional values, but more importantly, they can enhance the confidence of future personal decisions, as well as enable us to find the source of wrong decisions and correct them. In addition, these counterfactual reasoning experience, for others in a similar situation, also have important reference significance.
Many researchers put in the effort and came up with a series of methods to explore the causation. Uplift Modelling and causal Tree has been applied (Radcliffe and Surry, 2011; Rzepakowski and Jaroszewicz, 2012;Zhao et al., 2017;Athey and Imbens, 2015;Athey and Imbens, 2016;Tran and Zheleva, 2019).
In this paper, we propose a model combining Bidirectional Encoder Representation from Transformer (BERT) with multi sample dropout. After adopting feature frequency-inverse document frequency (TF-IDF) vectorizer, the model performs better. Different models are compared and we get the best F1 score by combining BERT with multi sample dropout, TF-IDF vectorizer ensemble and last-5-[CLS] token. Our model is based on real situations, so when counterfactual occur, they can be compared and identified. Multi sample dropout helps accelerate the training process and TF-IDF vectorizer with linear regression improves the robustness of the model.
Rest of the paper are organised as follows. In section 2, we show the overview of the counterfactual. The details of our model are explained in section 3 and results are shown in section 4. We summarize the whole model in section 5.

Related Work
There are a lot of measures to do the reasoning, which mainly divided into two groups, statics learning and deep learning. The most basic assumption in statistical learning theory is that training data and test data come from the same distribution. However, in most practical cases, the test data is extracted from a distribution that is only relevant to, but not identical to, the distribution of the training data, which is a big challenge to the reasoning. Besides, counterfactual distributions tend to differ from fact distributions. Johansson (2016) redefines the counterfactual problem as the problem of covariate transformation and later as domain adaptation problem, for the reason that actual distribution and the counterfactual result distribution are not the same. Besides, the following points are considered: a) the minimum error rate of factual outcomes; b) the use of relevant factual results to guide counterfactual results is done by constraining the results of similar interventions; c) the distribution of the interventions is similar and is overcome by minimizing the discrepancy distance. Here the discrepancy distance refers to the difference between the fact distribution and the counterfactual distribution in the representation space. Later Shalit (2017) improves the discrepancy distance to be the joint distribution. The new model as a whole is similar to the previous one but overcomes the following limitation: a) the need of a two-step optimization and linear hypotheses of the learned representation and the lack of supporting deep neural networks, b) the loss of the treatment indicator due to high-dimensional learned representation. However, the weight compensating for the difference in treatment group size in the sample is given.
Hassanpour (2019) improves the former weight to be learned and combines representation learning with re-weight. Firstly, representation learning was used to minimize selection bias and ensure that the factual results were as correct as possible. Re-weight could adjust the weight of samples to make the distribution of observed data and counterfactual data as consistent as possible.
In contrast to previous approaches to domain adaptation, Alaa (2017) considers counterfactual reasoning to be a multi task framework that mitigates selection bias through dropout bias scoring: each iteration has a certain dropout probability that depends on bias scoring. Their model uses a deep multi task network with a series layers to model potential (factual and counterfactual) outcomes.

Task Description
Task 5 mainly consists of two sub-tasks, Sub-task 1 is detecting counterfactual statements. According to official definitions, counter-factual refer to things that did not actually happen or cannot happen. In Sub-task 1, we need to determine whether each statement in the official dataset is counterfactual. For example: "If you prescribe a combination of paroxetine and exposure therapy two months ago, you can avoid her post-traumatic stress", we need to give a judgment. Sub-task 2 is to locate the cause and effect in counterfactual statements. For example:" Because she did not avoid her post-traumatic stress, (we know) no combination of paroxetine and exposure therapy". Some statements only have the first part without the latter part, which is incomplete and we need to assign "-1" to the index. (Yang et al., 2020) The rest of the papers focuses on the Sub-task 1.

Data Pre-processing
Abbreviation Replacement -It is widely known that people on social platforms usually use abbreviations to comment. For example, 'you're' is the abbreviation of 'you are'. In order to make our model more accurate, a dictionary is needed to substitute abbreviations in the English data set. Word Stemming -Word stemming can remove the affix to obtain the root word. For instance, the Word stemming operation move 'loving', 'loves', 'loved' to a common root 'love'. We use this method to map related words to the same stem generally gives satisfactory results in the English data set. Other Normalization Approaches -We lower the characters for BERT uncased and delete the stop words, which is meaningless in the English data set. Besides, TD-IDF Vectorizer has been adopted to transform the original text to be the feature matrix.

Methodology
Our model is based on BERT and last-5-[CLS] token to tackle the issue of counter-factual detection. Before the training process, TF-IDF vectorizer have been adopted to empowering each work with their importance. We divided the training data into 5 folds and use the cross validation to improve the training process. To avoid overfitting, as well as accelerating training and improving generalization, we have applied the multi-sample dropout in our model. The details are as follows.
• Light Gradient Boosting Machine(LightGBM) (Ke et al., 2017) is a framework that implements the gradient boosting decision tree (GBDT) algorithm and supports efficient parallel training. It has several advantages, faster training speed, lower memory consumption, distribution sport (i.e. it can quickly process massive data.) • TF-IDF measures the importance of a word to a document, which is often used as a weight vector. The TF-IDF value increases as the number of times a word appears in the document, and is offset by the number of documents that contain the word, which helps to adjust for the fact that some words appear more frequently. This measurement is widely used in NLP models, and it also performs well in counterfactual detection.
• BERT is one powerful model proposed by Google research team (Devlin et al., 2018), which has two steps, pre-training and fine-tuning. During pre-training stage, BERT is trained on unlabeled online news over pre-training tasks. For fine-tuning, all of the parameters that have been initialized in the pre-training process, will then be fine-tuned in the counterfactual detecting task. In this task, we conduct experiments with different models, BERT has better performance than other models on detecting counterfactual. We compare the F1 value between BERT and non-BERT model as shown in Table 1 • Dropout is a commonly used regularization method in deep neural network (DNN). During the training process, dropout randomly ignores some neural to avoid overfitting. The proposed method in this work adopts multi-sample dropout (Inoue, 2019), which both accelerate the training process of neural network and improve generalization over the traditional dropout. Multi-sample dropout can be easily implemented into our language model. After adding multi-dropout to BERT, the stability of the model has improved a lot.
• We call the beginning of sentence a token.
[CLS] is a token, which can represent the whole sentences.
Last-5-[CLS] has a good performance in the Internet news sentiment analysis, which is based on A Robustly Optimized BERT Pretraining Approach (ROBERTa) (Liu et al., 2019). We use this token to better represent each sentence so that can improve the accuracy on identifying counter-factual.

Result
Our best results for Task5 are summarized in Table 2

Conclusion
To detect counterfactual online effectively, we have applied the last-5- [CLS] token and BERT to deal with the multiple types of news in Task 5. To reduce the computation cost, we have also applied the TF-IDF to measure the importance of each word. In the training process, we have used the multi-sample dropout to avoid overfitting and accelerate training speed. Over all, our work has show competitive results when comparing with the others. We use the multi sample dropout to improve the robustness compared with dropout. Also, the combination of TF-IDF vectorizer and lat-5-[CLS] token is also a creative way to improve the accuracy rate by extracting more detailed information.