Yseop at SemEval-2020 Task 5: Cascaded BERT Language Model for Counterfactual Statement Analysis

In this paper, we explore strategies to detect and evaluate counterfactual sentences. We describe our system for SemEval-2020 Task 5: Modeling Causal Reasoning in Language: Detecting Counterfactuals. We use a BERT base model for the classification task and build a hybrid BERT Multi-Layer Perceptron system to handle the sequence identification task. Our experiments show that while introducing syntactic and semantic features does little in improving the system in the classification task, using these types of features as cascaded linear inputs to fine-tune the sequence-delimiting ability of the model ensures it outperforms other similar-purpose complex systems like BiLSTM-CRF in the second task. Our system achieves an F1 score of 85.00% in Task 1 and 83.90% in Task 2.


Introduction
A counterfactual can be defined as something that is contrary to the truth or that did not actually occur. It refers to an event that did not or cannot happen, as well as the possible consequences if it had happened. In the sentence "If dogs had no ears, they could not hear", the statement "if dogs had no ears" is an example of a counterfactual because dogs do have ears. Task 5 of SemEval-2020 (Yang et al., 2020) focuses on identifying these specific sentence types amongst sentences delivering close semantic similarities. This implies understanding and disambiguating the causal link between two sentence fragments.
We approached this task as an opportunity to test the effectiveness of disambiguating at a grammatical level against traditional baseline systems. This paper describes a parallel approach in deriving some meaning from text to leverage the influence of context and relevance of structure in recognizing counterfactual statements. Specifically, we will explore how many expressions of such statements a mapping of grammatical types can cover before falling short to high-performing models, most notably BERT (Devlin et al., 2018), which performed well on both tasks with an F1 score of 85.00% on Task 1 and 83.90 % on Task 2.

Related Work
Although the task of detecting counterfactuals is relatively new, (Son et al., 2017) proposes using modal logic to form rule-based determination methods from social media posts. These methods are supplemented by a statistical classifier (Linear SVM) that is retrained to tackle more challenging counterfactual forms.
Previous work on causal identification by (Levin and Hovav, 1994) studied the contribution of verbs in the determination of causal relations. By analyzing closely different formulations, they concluded that similarities in meaning can be derived from very different syntactic structures As for causality relation extraction, different deep learning systems built on the success of neural-based models have been proposed. The linguistically informed CNN model (Dasgupta et al., 2018) leverages the use of word embeddings and other linguistic features to detect causal patterns and outperforms rule-based classification. (Liang et al., 2019) introduces a multi-level causal detector that makes use of multi-head self-attention to capture semantic features at word level and infer causality at segment level. This engineered system has rivaled state-of-the-art models in terms of performance and thorough understanding of complex semantic information such as discourse relations and transitivity rules. Finally, (Li et al., 2019) presents a self-attentive BiLSTM-CRF model that makes use of transfer learning to overcome the problem of data insufficiency and extract causal relations in natural language text. This solution transfers a trained embedding from a large corpus and uses the causality tagging scheme to identify dependencies between cause and effect. Experimental results prove the effectiveness of this model, but its major limitation is the insufficiency of high-quality annotated data to learn from.

Dataset
The datasets used are those provided by the shared task organisers. The data is described in (Yang et al., 2020). As per official task instructions, no additional data was used.

Task 1: Classification Problem
Task 1 is a classification problem which aims at recognising text sections as either counterfactual or not. Counterfactual sections are labeled 1 and non-counterfactual sections are labeled 0.
The proposed baseline by the task organisers is a Support Vector Machine (SVM) classifier. After a preliminary phase based on exploring grammatical features engineering, we discuss two different approaches and compare them to the baseline: a classical approach using popular machine learning classifiers with semantic and additional grammatical features, a combination that has shown to perform well on information retrieval tasks relying on text understanding (Dai and Callan, 2019), and a deep learning approach using a BERT linear classifier. The objective of this competitive approach is to determine whether a supplement of linguistic features can be sufficient to correctly recognise counterfactual structures as opposed to running heavier models integrating broader contextual knowledge.

Exploration
As the data set contains sentences with close semantics, we run a first exploratory analysis to try to establish disambiguation patterns. A manual linguistic analysis of the training dataset shows that verbs are key elements for the detection of counterfactuals. (Son et al., 2017) identify 7 characteristics related to counterfactuals, all depending on a verb feature. Verb tenses in particular are key to disambiguation, so we build a generalist grammar based on our observation to include only the verbal forms that seem relevant for our analysis. We categorize could, would and should as modals so they wouldn't be identified as verbs in the preterit tense. Finally, we add a pattern to disambiguate the category of wish (verb) and wishes (noun), and identify conditional statements (If). We eventually retain 4 main features among the 7 described by (Son et al., 2017): Verb inversion, Modal chunks, Wish verbs and If clauses. We then generate combinations of tokens based on their grammatical category (i.e., verb, pronoun) as well as their linguistic properties (i.e., tense). The full grammar is provided in Appendix A.
However, some cases resist disambiguation, especially the sentences containing could/would/should have structures, which are very common in English and in many cases do not imply a counterfactual. Moreover, the context of these structures can vary which makes them even more complex to disambiguate. Eventually, the ambiguity between adjectives and past participles can raise many mistakes in the identification of counterfactuals when using grammars. Adjectives constructed with the suffix -ed are also recognised as verbs in preterit tense or past participles by the grammar (i.e., He wasn't prepared).
By deterministically applying the grammar on the training set, we obtain an output of 2833 sentences recognised as counterfactuals (on a total of 13000). However, only 593 of them are labeled "1" in the training set. A manual analysis of the output reveals that many statements are recognised as counterfactuals because of remaining ambiguities.
We turn to classical linear and non-linear machine learning methods to explore if combining these patterns carries any prior knowledge on disambiguation. We turn all 4 features described in Appendix A into binary variables, flagging the existence (True) or non-existence (False) of the category and associated patterns in the sentence. An example of the data transformation is provided in Appendix B. We end up with a table of 31 features that are evaluated for predicting a counterfactual statement with the following learning methods and parameters (using scikit-learn classifiers 1 ): SVM: gamma: scale LOGIT: l1 ratio: 0.5, penalty: elasticnet, solver: The results provided in Table 1, especially the very poor recall metrics, prove our initial set of features are incomplete and noisy. We then decide for a more holistic approach, resuming any prior knowledge on the syntax to evaluate if existing latent variables can possibly lie in the raw morpho-syntax (POS tags) and semantic of counterfactual statements.

Operating methods
We first try a classical approach, supplementing classifiers with word vectorisation (Bag Of Words (BOW), TF-IDF, Word2Vec and BERT vectors) and morpho-syntactic features (POS tagging). The details of these transformations are provided in Appendix C. The classifiers evaluated are: SVM, K-Nearest-Neighbor (KNN), Multinomial Naive Bayes (NB), Decision Trees (CART), Random Forest (RF) and Multi-Layer Perceptron (MLP). Finally, we add a cross validation method with a stratified fold of 3 to our models. As shown in Table 2, the best learning models seem to have strong linearisation capabilities (MLP and SVM), which is why we also tested a deep learning model which has proved similar behaviour.
For the deep learning approach, we use a model inspired from the BERT model that achieved the best result in a similar text classification task in the NLP4IF-2019 Shared Task (Da San Martino et al., 2019). This is a BERT model with a linear layer on top. This system has proven to pay special attention to adjectives and verbs, two grammatical categories that can play a role in identifying counterfactual statements (Levin and Hovav, 1994). For our implementation, we use the BertForSequenceClassification model from the Transformers 2 library, a sentence-tokenized version of the BERT Uncased model with 12 Transformer layers and 110 million parameters.

Results
The results are based on the official training set provided by the organisers. The dataset contains 13000 lines split in the following: 40% training sample, 30% validation sample and 30% test sample.
For the classical approach, we retain the best result for each classifier out of all the possible combinations of classifiers and text processing and compare with the baseline provided for the task and the deep learning model. The +/-scores are the averaged measures from the 3-fold cross-validation results for all models except the BERT Linear Model, whose results are provided with the scikit-learn default parameters.  The blind test set consists of 7000 unlabeled lines of text. Our best model, the BERT Linear, achieves an F1 score of 85.00%, a Precision score of 84.20% and a Recall score of 85.90% on this set.

Task 2: Sequence Delimitation
Task 2 is a sequence delimitation problem. The text sections are similar to the ones labeled "1" in the Task 1 dataset. The purpose of this task is to extract, in a text section identified as counterfactual, the sub-strings identifying the antecedent and the consequent elements (Yang et al., 2020). We use the following split sampling for the training data (3551 individuals): 1740 sentences for the training sample, 746 sentences for the validation sample and 1065 sentences for the test sample.

Proposed Method
Our approach for this task is also comparative and evaluates two deep learning models. It consists in testing whether we can supply enough linguistic knowledge to determine all causal forms in counterfactual statements to challenge the breadth and depth of a model that leverages the power of BERT.
For both systems, we designed each provided statement in the training set as chunks. These chunks are composed of tokens labelled as C when the token belongs to a sub-string Consequent and A when the token belongs to a sub-string Antecedent. Tokens belonging to neither are marked I. These transformations of the target sequences, additional transformations and their input levels are described in Appendix D (D.0.1 and D.0.2). These target features are identified as CHUNKS in our results Table 3 and 4. Since we can tackle this task through token classification, we build a first Sequence Extractor system using the BERT model from task 1 and a wrapper 3 to concatenate a Multi-Layer Perceptron classifier layer on top of it, as demonstrated in (Dai and Callan, 2019).
The second system is inspired by discriminative models and Conditional Random Fields (CRF) in particular. We design the system as a Named Entity tagger that takes sentence tokens and a set of morpho-syntactic token-level features as input and predicts the target class of each token.
We enhance the discriminative properties of the CRF by working on some additional layers and modeling a Deep Learning CRF. We experiment with linguistic embeddings, features and regularisation methods as enhancements for the final BiLSTM-CRF Neural Model. A high-level diagram of the system architecture is presented in Appendix E.
The full configurations for both systems are detailed in Appendix D (D.0.3 and D.0.4).

Results
Since we are working with a complex model with high tuning capabilities, the most relevant results for the BiLSTM-CRF system on the training dataset are retained and presented in Table 4. The BERT-MLP model results are shown in Table 5. Here again the +/-measures are averaged from the 3-fold cross validation scores.
The results show that generalization might be a problem for both systems as discussed in (Zeyer et al., 2019). The gap in performance between the BiLSTM-CRF and BERT-MLP is due to the superior

Conclusion
In this paper, we presented our experiments for identifying counterfactual statements. Modeling features derived from a linguistic analysis such as specific grammar structures for counterfactual statements and coupling them with established machine learning or deep learning models did not perform as well as context-learning models, as our hybrid BERT-MLP solution outperforms even complex combinations of deep learners and displays a better level of understanding and handling challenging counterfactual forms. Future work could explore the impact of graph knowledge in accommodating systems and rendering them more perceptive of implicit and ambiguous textual meanings.  We transform the different representations of the 4 grammatical features listed in Appendix A into categorical variables. Table 6 illustrates an example of one feature representation applied to input sentences.  We perform a two-level processing on the input: vectorial and morpho-syntactic. The vectorisation methods used are: Bag Of Words (BOW), TF-IDF, Word2Vec and BERT vectors. We perform a series of operations on the raw input text like removing numbers, punctuation and stop words, replacing negative contraction verbs with their complete forms (i.e., won't), splitting compound forms (i.e., state-of-the-art) and transforming text to lowercase. We also replace multiple white spaces with a single space. Examples of these transformations are provided in Table 7. For TF-IDF and Word2Vec, we experiment with and without stop word removal. We use the NLTK 5 stop words set and increase it with contraction patterns like 're or 'm. For the BERT and Word2Vec vectors, we refrain from applying these cleaning operations to maintain more semantic freedom. At the end of this phase, we generate a list of all the unique words in the training data called the vocabulary.
For the morpho-syntactic phase, we apply POS-based stemming and lemmatisation for the BOW and TF-IDF embeddings. We also remove words with frequency less than 5 for these embeddings. This effectively decreases the dimensions of BOW and TF-IDF vectors.
Our final feature list consists of 2000 features that are the unique lemmatized vocabulary words and word groups curated from the input text. However, despite their simplicity and low time complexity, BOW and TF-IDF have two major drawbacks. First, as the size of the data and the number of unique words in the training text increase, the length of vectors becomes much larger. Moreover, in these two approaches, only words and their repetitions are important and the order of the words in the text is not considered in the model. This is why we also consider both Word2Vec and BERT embedding approaches in our experiments.

Transformation Example No transformation
If the lawsuit can be another means of focusing attention on these fundamental issues, then optimistically the lawsuit can provide a larger benefit. Cleaning if the lawsuit can be another means of focusing attention on these fundamental issues then optimistically the lawsuit can provide larger benefit Cleaning + Normalization If the lawsuit can be anoth mean of focu attent on these fundament issu , then optimist the lawsuit can provid a larg benefit . Cleaning (Word2Vec + No Stop Words) lawsuit another means focusing attention fundamental issues optimistically lawsuit provide larger benefit Cleaning (BERT) if the lawsuit can be another means of focusing attention on these fundamental issues, then optimistically the lawsuit can provide a larger benefit. We apply the same feature sets to our two systems. The bold mentions refer to their description in the results Table 3 and Table 4. For the BERT-MLP system, the features are introduced as additional input layers and calculated in the embedding layer. For the BiLSTM-CRF system, these features are declared in the CRF layer.
For POS tags, we use the Stanford 6 Part-Of-Speech Tagger to determine the role of each word in the discourse. For chunking the target sequences (CHUNKS), we tokenize each sequences and label each token of the chunk with its segment label (i.e., A for Antecedent and C for Consequent). In order to generate BERT vector features (BERTvec), we use the BERT-as-service 7 library (version 1.10.0) as a sentence-encoder to map variable-length sentences to fixed-length feature embeddings. We also experiment with Syntactic Grammars (SG) by using the Stanford 8 Parser to generate syntactic dependency relations between words. Since our task requires labeling tokens, we could use a feature that references structure. As the results from (Reimers, 2017) show, the BIO tagging scheme performs consistently well for this type of task, which makes it the ideal candidate to add robustness to our feature set. Using the Stanford 9 Named Entity Recognizer, which is trained on BIO entity tagged data, we generate the BIO representation of sentence entities, or BIO Named Entity Recogition (BIO NER) tags.
Embeddings are computed from convolution of the different D.0.1 features (plus additional embeddings for BiLSTM-CRF) stacked in the input layer. The bold mentions refer to their description in the results Table 3 and Table 4. For the BERT-MLP system, we add the cascaded layers of token-features referenced in section D.0.1 to the existing embeddings architecture . Our final embeddings layer is a concatenation of the 3 BERT embeddings (positional, segment, token) and the generated feature layers. For feeding the BiLSTM-CRF system, we evaluate several embedding methods . The first is Pre-trained Word Embeddings. This widely used technique helps tackle the problem of generalising unseen words, since word embeddings are good at capturing general syntactic as well as semantic properties of words (Reimers, 2017). For our experiments, we focus on two different approaches: the GloVe embeddings trained on Common Crawl (about 840 billion tokens) and the FastText approach trained on Common Crawl (600 billion tokens) which also extracts subword information.
As per the recommended approach of (Ma and Hovy, 2016), we also compound Character Embeddings (C2IDX) to take into account character-level representation of words as an additional layer to the word embedding layer. Finally, we also consider Stacked Embeddings (Stacked) (i.e., a combination of existing embedding techniques designed to act in succession in a single pipeline to generate a refined embedding for our input). We use the Flair 10 library (version 0.4.5) to achieve this. Our embeddings pipeline is composed of: a GloVe model, a Flair-forward model, a Flair-backward model and a BERT embedding model. By targeting these models, we cover different aspects of semantic representation for our input: the GloVe module targets word representation, the Flair modules are for character contextualisation, and the BERT layer is for sentence-level information extraction.

D.0.3. Regularisation.
Given the danger of over-parameterisation that neural networks present, we introduce some regularisation techniques.
For the BiLSTM-CRF model, we perform a K-Fold Cross Validation with K = 3. The data is not shuffled before splitting into batches. We also add Dropout. Results from (Reimers, 2017) show that variational dropout performs best when it comes to BiLSTM networks. Furthermore, it can be shown in (Cheng et al., 2017) that relatively smaller dropout tends to yield better results for LSTM networks. For our experiments, we implement a variational dropout on all layers with the fraction p of dropped values from the set {0.1, 0.3, 0.5}. The value of p = 0.1 performs best after empirical testing and is retained for our final round of benchmark tests. We couple our system with an Elasticnet method (i.e., a linear regression model with combined L1 and L2 priors). The tested combinations of regularisation methods are the following (displayed as A, B, C in result Table 3) : • Cross Validation (A) • Cross Validation + Dropout (B) • Cross Validation + Dropout + Elasticnet (C) The reference labels A, B, C are used in Table 3.
For the BERT-MLP system, we apply the default regularisation A.

D.0.4. Hyperparameters.
We also evaluate the effects of different hyperparameters on the performance of our models. For the BiLSTM-CRF system, the tested configurations in Table 8 are limited to the hardware we use. The hardware specifications are: i7 CPU processor, 16 GB of RAM with a GPU of 8 GB.