HIT-SCIR at SemEval-2020 Task 5: Training Pre-trained Language Model with Pseudo-labeling Data for Counterfactuals Detection

We describe our system for Task 5 of SemEval 2020: Modelling Causal Reasoning in Language: Detecting Counterfactuals. Despite deep learning has achieved significant success in many fields, it still hardly drives today’s AI to strong AI, as it lacks of causation, which is a fundamental concept in human thinking and reasoning. In this task, we dedicate to detecting causation, especially counterfactuals from texts. We explore multiple pre-trained models to learn basic features and then fine-tune models with counterfactual data and pseudo-labeling data. Our team HIT-SCIR wins the first place (1st) in Sub-task 1 — Detecting Counterfactual Statements and is ranked 4th in Sub-task 2 — Detecting Antecedent and Consequence. In this paper we provide a detailed description of the approach, as well as the results obtained in this task.


Introduction
Deep learning technologies have been truly remarkable and have caught many attentions of researchers. However, deep learning systems' lack of understanding of causal relations is perhaps the biggest roadblock to giving them human-level intelligence (Pearl, 2019). Causation is a fundamental concept in human thinking and reasoning, which indicates a special semantic relation between the cause with the effect (Stukker et al., 2008). Pearl and Mackenzie (2018) propose that there are three levels of causation, and the top level is the counterfactual analysis. Counterfactual statements describe events that did not actually happen or cannot happen, as well as the possible consequence if the events have had happened (Yang et al., 2020). SemEval 2020 Task 5 consists of two subtasks which aim to detect counterfactual descriptions in sentences, this paper presents solutions to both of two subtasks.
Subtask1 -Detecting counterfactual statements refers to determining whether a given statement is counterfactual or not in several domains. For example, given the sentence: "Her post-traumatic stress could have been avoided if a combination of paroxetine and exposure therapy had been prescribed two months earlier.", the system is required to determine whether it is a counterfactual statement or not. This categorization task is evaluated by three evaluation indexes: Precision, Recall and F1. The challenges of this subtask include: a) positive and negative samples are unevenly distributed in the training data and test data. b) the patterns of counterfactual statements can be hardly defined by rules or learned by traditional word embedding based models.
Subtask2 -Detecting antecedent and consequence aims to locate antecedent and consequent in counterfactuals. A counterfactual statement can be converted to a contrapositive with a true antecedent and consequent (Goodman and Nelson, 1947). Consider the "post-traumatic stress" example discussed above; it can be transposed into "because her post-traumatic stress was not avoided, (we know) a combination of paroxetine and exposure therapy was not prescribed". Such knowledge can be not only used for analyzing the specific statement but also be accumulated across corpora to develop domain causal knowledge (e.g., a combination of paroxetine and exposure may help cure post-traumatic stress). The results are evaluated by four indexes: Precision, Recall, F1 and Exact Match. Unlike normal sequence labeling tasks, in some cases there is only an antecedent part while without a consequent part in a counterfactual statement. For instance, "Thanks for the article on this new term that fits me so well, wish all your articles were worthy of praise.", there is only the antecedent part "wish all your articles were worthy of praise" in this sentence.
Recently, pre-trained language models have achieved great success in various NLP tasks (Devlin et al., 2018;Yang et al., 2019;Liu et al., 2019). Hence, we first explore multiple pre-trained language models (i.e., BERT, XLNet and RoBERTa) stacking methods with fine-tuned in counterfactual data. These models are basically trained in a supervised fashion with labeled data. However, There is not very sufficient labeled data in Task 5. Hence, we use a simple way of training neural networks in a semi-supervised fashion. Basically, each pre-trained language model is trained in a supervised fashion with labeled data. For unlabeled data, Pseudo-Labels, created by just picking up the class which has more votes from pre-trained language models, are used as if they were true labels.

Related Work
Subtask 1 is a text classification task. Traditional machine learning systems, such as Naive Bayes (Domingos and Pazzani, 1997) and Support Vector Machines (Cortes and Vapnik, 1995) perform well in this task. In recent years, text classification has made breakthroughs in deep learning. A few of studies (Kim, 2014;Liu et al., 2016;Lai et al., 2015) made a series of improvements to the structure of deep neural networks. BERT (Devlin et al., 2018), a dynamic word vector based on the language model, has achieved very competitive performance in multiple domains of text classification. In addition, pre-trained language models such as XLNet (Yang et al., 2019), RoBERTa (Liu et al., 2019), ALBERT (Lan et al., 2019) have also achieved high performances in various NLP tasks.
Subtask 2 is a sequence labeling task. Traditional work uses statistical approaches such as Hidden Markov Models (HMM) and Conditional Random Fields (CRF) (Ratinov and Roth, 2009;Passos et al., 2014;Luo et al., 2015). Deep learning sequence labeling models utilize word-level Long Short-Term Memory (LSTM) structures to represent global sequence information and a CRF layer to capture dependencies between neighboring labels Lample et al., 2016;Ma and Hovy, 2016;Peters et al., 2017). In similar manner, fine-tuning the pre-trained model to complete the task of sequence labeling such as BERT has achieved excellent performances.

Methodology
For two subtasks (i.e., text classification task and sequence labeling task), we start by fine-tuning pretrained language models, including BERT, RoBERTa and XLNet. Then, we enlarge the training set with pseudo-labeled sentences, which are predicted on the test set by voting ensemble of several single pre-trained models. Finally, we use the ensemble model to classify counterfactual statements for subtask 1 and utilize a CRF model to perform sequence labeling for extracting antecedent and consequence in subtask 2. We will introduce each module in details in the following sections.

Fine-tuning Pre-Trained Models
BERT (Devlin et al., 2018) is a revolutionary self-supervised pre-training technique that learns to predict intentionally hidden (masked) sections of text. Crucially, the representations learned by BERT have been shown to generalize well to downstream tasks, and when BERT was first released in 2018 it achieved state-of-the-art results on many NLP benchmark datasets.
XLNet (Yang et al., 2019) is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective. Additionally, XLNet employs Transformer-XL as the backbone model, exhibiting excellent performance for language tasks involving long context. Overall, XLNet also achieves state-of-the-art results on various downstream language tasks.
RoBERTa (Liu et al., 2019) builds on BERT's language masking strategy and modifies key hyperparameters in BERT, including removing BERT's next-sentence pre-training objective, and training with much larger mini-batches and learning rates. RoBERTa is trained on an order of magnitude more data This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/. than BERT, for a longer amount of time. This allows RoBERTa representations to generalize even better to downstream tasks compared to BERT.
For subtask 1, we use the default models for sentence classification implemented in the huggingface library. For subtask 2, which can be formulated as a sequence labeling task, sentences are fed into RoBERTa and then into a classification layer, producing a sequence of tag scores for each sentence. The sub-token entries are removed from the sentences and the remaining tokens are passed to the CRF layer. The maximum context tokens are selected and concatenated to form the final predicted tags. Generally, we use Viterbi algorithm to simplify the calculation of CRF and adopt the standard CRF loss function

Pseudo-Labeling
Pseudo-labeling is a simple but effective semi-supervised learning method that can improve the performance of machine learning models by utilizing unlabeled data (Lee, 2013). In subtask 1, firstly we fine-tune different pre-trained language models on the training set as pseudo-label predictors, then we use these classifiers to provide pseudo-labels on sentences in the test set. For each pseudo-labeled sentence in the test set, we add it into the training set if and only if all classifiers agree on the labeling of this sentence. In subtask 2, we use k roberta-large models trained by k-fold cross-validation as pseudo label predictors to make predictions on the test set, and we add pseudo labels into the training set when at least n (n ≤ k) models agree with them.

Ensemble
Ensemble has shown its power on effectively improving the robustness and accuracy of each individual prediction models (Opitz and Maclin, 1999;Rokach, 2010). By ensembling predictions from models with different hyper-parameters or architectures, we can get better results than each individual model. In our system, XLNet and RoBERTa have different training objectives, BERT and RoBERTa have different hyper-parameters. Thus, we propose to combine their probability predictions via weighted average ensemble, and then optimize the classification threshold for optimal F1 score on the combined prediction.

Sequence Labeling with CRF
In subtask 2, we use CRF model to perform sequence labeling task and extract antecedents and consequences in the counterfactual statements. With an additional CRF output layer, our model can calculate not only the emission score but also the transition score. The transition score from CRF is considered to guarantee the integrity and correctness of the output label. We label antecedents and consequences in the counterfactual statements with BIO tags.

Evaluation Metrics
• Subtask1: Precision, Recall, and F1. The evaluation will verify whether the predicted binary "label" is the same as the desired "label" which is annotated by human workers, and then calculate its precision, recall, and F1 scores.
• Subtask2: Exact Match, Precision, Recall, and F1. Exact Match will represent what percentage of both predicted antecedents and consequences are exactly matched with the desired outcome that is annotated by human workers. F1 score is a token level metric and will be calculated according to the submitted antecedent startid, antecedent endid, consequent startid, consequent endid.

Experimental Details
Experimental Environment. Our experimental codes relied heavily on BERT, RoBERTa and XLNet models, which are implemented in the Huggingface Transformers package(2.5.1) with Pytorch(1.2.0). Computations were performed on Tesla P100-PCIE-16GB GPUs.
Hyper-Parameters of Pseudo-Labeling. For pseudo-labeling in subtask 1, we train a bert-large-cased model, an xlnet-large-cased model and a roberta-large model on the whole training set as pseudo label predictors, respectively. We use an AdamW optimizer to tune the parameters with learning rate = 2e-5, epsilon = 1e-8, max seq length = 128, batch size = 16, seed = 42 and we train each model for 4 epochs. Then we use these pseudo-label predictors to predict on the test set and get 6,867 pseudo-labels which all predictors agree with them. For subtask 2, we use five different roberta-large models trained in 5-fold cross-validation as pseudo-label predictors to make predictions on the test set, and we add the pseudo-label to the training set if and only if at least N models agree with it. We also use an AdamW optimizer to tune the parameters with learning rate = 1e-5, epsilon = 1e-8, max seq length = 250, batch size = 8, seed = 19,970,514 and we train each model for 5 epochs. Then we can obtain 822 pseudo-labels when N = 4 and 524 pseudo-labels when N = 5.
Hyper-Parameters of Our System. We firstly add all the pseudo-labeled sentences to the training set, then we train a bert-large-cased model, an xlnet-large-cased model and a roberta-large model on the training set, respectively. After that, we use weighted average ensemble to combine the predicted probabilities of BERT (weight=1), XLNet (weight=1) and RoBERTa (weight=2). Due to the imbalanced distribution of the training set for subtask 1, models trained on it tend to predict more negative labels with a threshold of 0.5. Hence, we used a lower threshold (0.3437), which is the best threshold for the average F1 of 10-fold cross-validation on the training set, for the final submission. The hyper-parameters of our system used in both subtasks are shown in Table 2. BERT(st1) XLNet(st1) RoBERTa(st1) RoBERTa(st2) batch size 12 12 8 8 learning rate 2e-5 2e-5 2e-5 5e -6  max epochs  3  3  3  3  max sequence length  128  128  128  250  random seed  42  42  42  19970514   Table 2: Hyper-Parameters of our system for subtask 1 and subtask 2

Results
As there is no validation set in Task 5, we use cross-validation of the training set as the validation set. We tune our system on the validation set and report the experimental results in Table 3 and Table 4. Then we choose the system that achieved the best performance on the validation set to be evaluated on the test set, and report the experimental results in Table 5 and Table 6. Table 3 shows results of 10-fold cross-validation on the training set of subtask 1. We can find that models trained with extra pseudo-labels get higher F1 score than models trained only with the original training data. The weighted average ensemble model of single models trained with extra pseudo labels outperforms all single models and achieves the highest F1 score. Table 4 shows results of 5-fold cross-validation on the training set of subtask 2. We can find that the model trained with extra pseudo-labels from single model gets higher Recall, Precision and Exact Match than the model trained with the original training data. Table 5 shows results on the test set of subtask 1. We can find that the results on the test data are similar to the results on the validation set. We can also see that models trained with pseudo-labels gain better   Table 4: Results of 5-fold cross-validation on subtask 2. 4commonPL means that we add a pseudo-labeled sample of the test set into the training set when 4 models agree with it.
F1 score than models trained with only the original training data. The ensemble model (#8) trained with pseudo-labels achieves the best result. Table 6 shows that the models trained with only the original training data obtained better performance than models trained with pseudo-labels. Pseudo-labels on the test set do not provide obvious help. A possible explanation is that the pseudo-labels cannot contain enough high-confidence labeling data to provide any benefit. In addition, in order to ensure the correctness of the label format, we used some rules on the optimal model. For example, if the start of the antecedent is labeled as -1 by the best model, then we find reasonable result from output of other models.

Conclusion
Counterfactuals inference is a challenging problem in causation, which is crucial for today's AI system. In this paper, we designed and implemented an ensemble model for detecting counterfactuals. It achieved the top F1 score of 90.90% on the Evaluation-Subtask 1 leaderboard and 4th F1 score of 84.10% on the Evaluation-Subtask 2 leaderboard. Task 5 mainly explores how to detect counterfactuals, which is the basis of counterfactuals inference. We believe that causation, especially counterfactuals will attract more attention from researchers driven by this task.