CNRL at SemEval-2020 Task 5: Modelling Causal Reasoning in Language with Multi-Head Self-Attention Weights Based Counterfactual Detection

In this paper, we describe an approach for modelling causal reasoning in natural language by detecting counterfactuals in text using multi-head self-attention weights. We use pre-trained transformer models to extract contextual embeddings and self-attention weights from the text. We show the use of convolutional layers to extract task-specific features from these self-attention weights. Further, we describe a fine-tuning approach with a common base model for knowledge sharing between the two closely related sub-tasks for counterfactual detection. We analyze and compare the performance of various transformer models in our experiments. Finally, we perform a qualitative analysis with the multi-head self-attention weights to interpret our models’ dynamics.


Introduction
Causal reasoning is a process of detecting cause-effect relationships and is increasingly being used in artificial intelligence for improving generalization and interpretability. Modelling causal reasoning in language involves detecting such cause-effect relationships from natural language texts. A cause-effect relationship can be modelled as: Event A causes Event B. Counterfactuals describe events counter to facts and hence naturally involve common sense and causal reasoning. A counterfactual can be modelled as a cause-effect relationship of the form: Event A could have caused Event B (Event A did not occur).
SemEval-2020 Task-5 (Yang et al., 2020) consists of two independent sub-tasks: a binary-classification task for detecting counterfactual statements and a span-detection task for detecting antecedent (cause)consequent (effect) spans of given counterfactual statements (Table 1). In this work, we use multi-head self-attention weights from pre-trained transformer models (Vaswani et al., 2017) to capture the semantic interactions between the tokens of given text with respect to causal relations. We use a fine-tuning approach with a common base model for knowledge sharing between these two closely related sub-tasks. The code for this work is made publicly available as a GitHub repository. 1

Background
Early work on causal reasoning and related tasks in natural language was based on various statistical and linguistic approaches (Asghar, 2016). Recent work for causal reasoning related tasks involves deep learning based approaches. Causal reasoning can be achieved through extraction of cause-effect relations with CRF and LSTM based sequence labelling tasks (Dasgupta et al., 2018). Counterfactuals can contain implicit causal relations (Table 1). Using multi-head self-attention at word level can help capture such implicit causal relations effectively (Liang et al., 2019). Current benchmarks for modelling causal reasoning involves question-answering tasks (Gordon et al., 2012). Using pre-trained transformer models have been effective on such tasks (Sap et al., 2019). Self-attention weights from such transformer models are usually structured in a 3-dimensional matrix. Using convolutional neural networks with these self-attention weight matrices can be helpful for extracting semantic features for downstream NLP tasks (Fang et al., 2019). Similar approaches can be used to extract features for detecting counterfactuals in

Counterfactual Statement
Antecedent Consequent "If I had 10 pharmacists who worked with me, I could reach 100 people more effectively." If I had 10 pharmacists who worked with me I could reach 100 people more effectively "Thanks for the article on this new term that fits me so well, wish all your articles were worthy of praise." wish all your articles were worthy of praise - Table 1: Example counterfactual statements from the task dataset natural language texts. Further, the multi-head self-attention attention weights from these models can also be used in interpretative qualitative analysis (Voita et al., 2019).

Base Architecture
We use the knowledge learned during the binary-classification counterfactual detection sub-task for the antecedent-consequent span-detection sub-task by defining a base architecture, common to both the sub-tasks ( Figure 1). The base architecture is used to extract task-specific features, which are further passed on to task-specific modules. We first train the base architecture with a binary-classification module for the first sub-task. Then, we replace the binary-classification module with a regression-module and fine-tune the already trained base architecture for the more complex second sub-task. For the base architecture, we use pre-trained transformer models to extract contextual output embeddings and multi-head self-attention weights from the tokenized input text. The output embeddings are passed through a pooling layer to get a pooled embedding. The multi-head self-attention weights are structured in a 3-dimensional matrix with the following dimensions: (number of attention heads, number of tokens in the text, number of tokens in the text). This matrix is passed through convolutional and linear blocks to get an attention embedding. The pooled embedding and the attention embedding are concatenated together to form a combined feature embedding. We apply a layer normalization operation on the combined feature embedding for better generalization and stability for knowledge sharing across the sub-tasks. The feature embedding is then passed through a linear block and fed to the task-specific module. A linear block is composed of a fully connected layer with ReLU activation and dropout regularization, and a convolutional block is composed of a 2D-convolution layer with batch normalization, ReLU activation and dropout regularization. Here, we experiment with various pre-trained transformer models which differ from each other in terms of pre-training approach, architecture and number of parameters: BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019) (robust pre-training), XLNet (Yang et al., 2019) (autoregressive model) and DistilBERT (Sanh et al., 2019) (distilled model). Usually, the last layer of any transformer model gets quickly biased for any individually trained task. Here, we concatenate the output embeddings and multi-head self-attention weights from the last three layers of the transformer model so that more generalized features are learned by the common base architecture while training for each of the sub-tasks separately.

Task Specific Modules
For the counterfactual detection sub-task, we have binary labels for various statements to be counterfactuals / non-counterfactuals. We use a linear block with sigmoid activation as a binary-classification module for this task. For the second sub-task, we have character-level span locations (start-id and end-id) for the antecedent and consequent spans of the given counterfactual statements. This can be treated as a regression problem with 4 feature values. We use an another linear block with ReLU activation and 4 output neurons as a regression module for this task. The lengths of various counterfactual statements in the second sub-task vary considerably across the dataset. This induces a certain variance in the character-level

Experiments
For all our experiments, we use Binary Cross Entropy loss (for counterfactual detection) and Smooth L1 loss (for antecedent-consequent span regression) with Adam optimizer (with weight decay) to train our models. We use the PyTorch implementations of the smallest base variants of pre-trained transformer models by Hugging Face 2  in our base architecture. We use a 90-10 data split for training and development purposes respectively. The data distribution across the splits is shown in Table 2. We validate our models on F1 score (for counterfactual detection) and Smooth L1 loss (for antecedent-consequent detection). The evaluation metrics for the counterfactual detection task are the binary precision, recall and F1 scores. For the antecedent-consequent span detection task, the precision, recall and F1 score are defined as sequence labelling metrics with respect to the overlap between the predicted and the ground truth spans 3 .

Discussion
For the final submission, we use RoBERTa and BERT as the transformer model in our base architecture for sub-task 1 and sub-task 2 respectively. On the final task leaderboard, our system ranks 13 th (0.845 F1) for the counterfactual detection sub-task and 7 th (0.688 F1) for the antecedent-consequent detection sub-task. Since we treat antecedent-consequent detection as a regression task, we do not monitor the Exact Match score between the predicted and ground truth spans, which is more significant for token-level sequence labelling based approaches. We analyse the performance of various transformer models with hyperparameter tuning post evaluation (Table 3). RoBERTa gives the best results for the counterfactual detection sub-task. Whereas, BERT gives the best results for the antecedent-consequent detection sub-task. DistilBERT, a considerably smaller model (65M parameters) shows comparable performance with the rest of the transformer models (110M+ parameters) for both the sub-tasks.  Since BERT performs marginally better than rest of the transformer models for antecedent-consequent detection sub-task, we consider BERT for the further qualitative analysis. We inspect the multi-head self-attention weights from the final layer of BERT (Vig, 2019) to interpret the model's dynamics. Overall, the model assigns more attention to certain parts of text which are related to the conditional nature of the counterfactual statements. Moreover, we see that some of the attention-heads learn to assign more attention to some specific parts of the text. Head 1,6 assign maximum attention to punctuation and Head 2,4,12 focus more on the auxiliary verbs. Head 3,11 attend to conjunctions and verbs which act as causal connectives in the text. Head 5 and Head 7 attend to entities and numerical values respectively (if present). This property of linguistically selective-attention of the attention-heads can be observed in the following examples of antecedent-consequent spans detected by our system (rounded off to include partially covered words). Where, an underline indicates the detected antecedent and the detected consequent is made bold.
4. I could 2,4,11 have been you and you could 3,12 have been me. 5. Of course the company wouldn't 2,4,7,12 have 3 had to sell such a prized asset if 11 it had other options to raise 5 capital.
The superscripts here represent the most attending attention-head(s) for the corresponding word. The same can be confirmed by a visualization (Figure 2) of the head-wise color coded self-attention weights. For example 3 and 4, we have no consequent part in the text. Our system detects (0,0) as consequent span start and end locations for such counterfactual statements, indicating the absence of the consequent. The conjunctions (but / and) in such counterfactual statements are ignored by the attention-heads. But the conjunctions (if ) in counterfactual statements with a consequent part (Example 1,2 and 5) are highly attended by the attention-heads through the tokens from entire sequence. This shows the ability of the model to differentiate the causal connectives from the non-causal ones in the text. Punctuation play an important role here as they are usually present near the boundaries of antecedent-consequent spans (Example 2 and 3). Auxiliary verbs (would, wouldn't, could, have) are assigned maximum attention across all the examples as they directly correspond to the conditional nature of counterfactual statements. Our proposed approach uses multi-head self-attention weights from transformer models to detect causal relations for counterfactual detection in text. Through our experiments, we find that RoBERTa overall shows the best performance for counterfactual detection task and BERT performs the best for the antecedent-consequent detection task. We show that even smaller transformer models like DistilBERT perform counterfactual detection tasks effectively. With knowledge sharing between the two sub-tasks, our system detects antecedent-consequent spans in counterfactual statements with good efficiency by a simple regression over the spans. This can possibly be further improved by post-inference processing on the predicted spans or replacing the regression module with a token-level sequence labelling module. Further, we show that through our approach, the attention-heads attain a property of assigning linguistically selective-attention with respect to the conditional nature of the counterfactual statements.