I Wish I Would Have Loved This One, But I Didn’t – A Multilingual Dataset for Counterfactual Detection in Product Review

Counterfactual statements describe events that did not or cannot take place. We consider the problem of counterfactual detection (CFD) in product reviews. For this purpose, we annotate a multilingual CFD dataset from Amazon product reviews covering counterfactual statements written in English, German, and Japanese languages. The dataset is unique as it contains counterfactuals in multiple languages, covers a new application area of e-commerce reviews, and provides high quality professional annotations. We train CFD models using different text representation methods and classifiers. We find that these models are robust against the selectional biases introduced due to cue phrase-based sentence selection. Moreover, our CFD dataset is compatible with prior datasets and can be merged to learn accurate CFD models. Applying machine translation on English counterfactual examples to create multilingual data performs poorly, demonstrating the language-specificity of this problem, which has been ignored so far.


Introduction
Counterfactual statements are an essential tool of human thinking and are often found in natural languages. Counterfactual statements may be identified as statements of the form -If p was true, then q would be true (i.e. assertions whose antecedent (p) and consequent (q) are known or assumed to be false) (Milmed, 1957). In other words, a counterfactual statement describes an event that may not, did not, or cannot take place, and the subsequent consequence(s) or alternative(s) did not take place. For example, consider the counterfactual statement -I would have been content with purchasing this iPhone, if it came with a warranty!. Counterfactual statements can be broken into two parts: a statement about the event (if it came with a warranty), also referred to as the antecedent, and * The two first authors contributed equally the consequence of the event (I would have been content with purchasing this iPhone), referred to as the consequent. Counterfactual statements are ubiquitous in natural language and have been wellstudied in fields such as philosophy (Lewis, 2013), psychology (Markman et al., 2007;Roese, 1997), linguistics (Ippolito, 2013), logic (Milmed, 1957;Quine, 1982), and causal inference (Höfler, 2005).
Accurate detection of counterfactual statements is beneficial to numerous applications in natural language processing (NLP) such as in medicine (e.g., clinical letters), law (e.g., court proceedings), sentiment analysis, and information retrieval. For example, in information retrieval, counterfactual detection (CFD) can potentially help to remove irrelevant results to a given query. Revisiting our previous example, we should not return the iPhone in question for a user who is searching for iPhone with warranty because that iPhone does not come with a warranty. A simple bag-of-words retrieval model that does not detect counterfactuals would return the iPhone in question because all the tokens in the query (i.e. iPhone, with, warranty) occur in the review sentence. Detecting counterfactuals can also be a precursor to capturing causal inferences (Wood-Doughty et al., 2018) and interactions, which have shown to be effective in fields such as health sciences (Höfler, 2005). Janocko et al. (2016) and Son et al. (2017) studied CFD in social media for automatic psychological assessment of large populations.
CFD is often modelled as a binary classification task (Son et al., 2017;Yang et al., 2020a). A manually annotated sentence-level counterfactual dataset was introduced in SemEval-2020 (Yang et al., 2020a) to facilitate further research into this important problem. However, successful developments of classification methods require extensive high quality labelled datasets. To the best of our knowledge, currently there are only two labelled datasets for counterfactuals: (a) the pio-neering small dataset of tweets (Son et al., 2017) and (b) a recent larger corpus covering the area of the finance, politics, and healthcare domains (Yang et al., 2020a). However, these datasets are limited to the English language.
In this paper, we contribute to this emerging line of work by annotating a novel CFD dataset for a new domain (i.e. product reviews), covering languages in addition to English, such as Japanese and German, ensuring a balanced representation of counterfactuals and the high quality of the labelling. Following prior work, we model counterfactual statement detection as a binary classification problem, where given a sentence extracted from a product review, we predict whether it expresses a counterfactual or a non-counterfactual statement. Specifically, we annotate sentences selected from Amazon product reviews, where the annotators provided sentence-level annotations as to whether a sentence is counterfactual with respect to the product being discussed. We then represent sentences using different encoders and train CFD models using different classification algorithms.
The percentage of sentences that contain a counterfactual statement in a random sample of sentences has been reported to be low as 1-2% (Son et al., 2017). Therefore, all prior works annotating CFD datasets have used clue phrases such as I wished to select candidate sentences that are likely to be true counterfactuals, which are then subsequently annotated by human annotators (Yang et al., 2020a). However, this selection process can potentially introduce a selection bias towards the clue phrases used.
To the best of our knowledge, while the data selection bias is a recognised problem in other NLP tasks (e.g., Larson et al. (2020)), this selection bias on CFD classifiers has not been studied previously. Therefore, we train counterfactual classifiers with and without masking the clue phrases used for candidate sentence selection. Furthermore, we experiment with enriching the dataset with sentences that do not contain clue phrases but are semantically similar to the ones that contain clue phrases. Interestingly, our experimental results reveal that compared to the lexicalised CFD such as bag-ofwords representations, CFD models trained using contextualised masked language models such as BERT are robust against the selection bias (Devlin et al., 2019). Our contributions in this paper are as follows:

First-ever Multilingual Counterfactual Dataset:
We introduce the first-ever multilingual CFD dataset containing manually labelled product review sentences covering English, German, and Japanese languages. 1 As already mentioned above, counterfactual statements are naturally infrequent. We ensure that the positive (i.e. counterfactual) class is represented by at least 10% of samples for each language. Distinguishing between a counterfactual and non-counterfactual statements is a fairly complex task even for humans. Unlike previous works, which relied on crowdsourcing, we employ professional linguists to produce a high quality annotation. We follow the definition of counterfactuals used by Yang et al. (2020a) to ensure that our dataset is compatible with the SemEval-2020 CFD dataset (SemEval). We experimentally verify that by merging our dataset with the SemEval CFD dataset, we can further improve the accuracies of counterfactual classifiers. Moreover, applying machine translation on the English CFD dataset to produce multilingual CFD datasets results in poor CFD models, indicating the language-specificity of the problem that require careful manual annotations.
Accurate CFD Models: Using the annotated dataset we train multiple classifiers using (a) lexicalised word-order insensitive bag-of-words representations as well as (b) contextualised sentence embeddings. We find that there is a clear advantage to using contextualised embeddings over noncontextualized embeddings, indicating that counterfactuals are indeed context-sensitive.

Related Work
Counterfactuals have been studied in various contexts such as for problem solving (Markman et al., 2007), explainable machine learning (Byrne, 2019), advertisement placement (Joachims and Swaminathan, 2016) and algorithmic fairness (Kusner et al., 2017). Kaushik et al. (2020) proposed an annotation scheme whereby the original data is augmented in a counterfactual manner to overcome spurious associations that a classifier heavily relies upon, thus failing to perform well on test data distributions that are not identical. Unlike Kaushik et al. (2020) and closely related work by Gardner et al. (2020), we are interested in identifying exist-ing counterfacts and filtering these statements to improve search performance.
A CFD task was presented in SemEval-2020Challenge (Yang et al., 2020b. The provided dataset contains counterfactual statements from news articles. However, the dataset does not cover counterfactuals in e-commerce product reviews, which is our focus in this paper. One of the earliest CFD datasets was annotated by Son et al. (2017) and covers counterfactual statements extracted from social media. Both datasets are labelled for binary classification by crowdsourcing and contain only sentences in English. We will compare our dataset to these previous works in § 3.4. To summarise, our dataset is unique as it contains counterfactuals in multiple languages, covers a new application area of e-commerce reviews, and provides high quality annotations.
To achieve the best prediction quality, ensemble strategies are employed. The top performing systems use an ensemble of transformers (Ding et al., 2020;Fajcik et al., 2020;Lu et al., 2020) CFD datasets tend be highly imbalanced because counterfactual statements are less frequent in natural language texts. Prior work has used techniques such as pseudo-labelling (Ding et al., 2020) and multi sample dropout (Chen et al., 2020) to address the data imbalance and overfitting problems.

Dataset Curation
We adopt the definition of a counterfactual statement proposed by Janocko et al. (2016) where they define it as a statement which looks at how a hypothetical change in past experience could have affected the outcome of that experience. Their definition is based on linguistic structures of 6 types of counterfactuals as following.
Conjunctive Normal: The antecedent is followed by the consequent. The antecedent consists of a conditional conjunction followed by a past tense subjunctive verb or past modal verb. The consequent contains a past or present tense modal verb. (Example: If everyone got along, it would be more enjoyable.) Conjunctive Converse: The consequent is followed by the antecedent. The consequent consists of a modal verb and past or present tense verb. The antecedent consists of a conditional conjunction followed by a past tense subjunctive verb or past tense modal. (Example: I would be stronger, if I had lifted weights.) Modal Normal: The antecedent is followed by the consequent. The antecedent consists of a modal verb and past participle verb. The consequent consists of a past/present tense modal verb. (Example: We should have gone bowling, that would have been better.) Wish/Should Implied: The antecedent is present, the consequent is implied. The antecedent is the independent clause following 'wish' or 'should'. The consequent is implied and can be paraphrased as "would be better off". (Examples: I wish I had been richer. I should have revised my rehearsal lines.) Verb Inversion: No specific order of the antecedent and consequent. The antecedent uses the subjunctive mood by inverting the verbs 'had' and 'were' to create a hypothetical conditional statement along with a past tense verb. The consequent consists of a modal verb and past or present tense verb. ( (2016). We worked with professional linguists to extend these counterfactual definitions for the German and Japanese languages. While the extension of the definition from English to German is relatively straightforward, the extension to syntactically and orthographically different structure of Japanese sentences was challenging (Jacobsen, 2011) and required re-writing the annotation guidelines including additional examples. The annotation guidelines are included in the dataset release.

Data Collection
The main step of data collection in the previous works (Son et al., 2017;Yang et al., 2020a) is filtering of the data using a pre-compiled list of clue words/phrases. Because the exact list of clue phrases used by Janocko et al. (2016) was not publicly available, we created a new list of clue phrases following the definitions of counterfactual types. In addition, we compiled similar clue phrase lists for German and Japanese languages. Yang et al.
(2020a) applied a more complex procedure, where they match Part of Speech (PoS)-tagged sentences against lexico-syntactic patterns. In our work, we do not consider PoS-based patterns, which are difficult to generalise across languages. We use the Amazon Customer Reviews Dataset, 2 which contains over 130 million customer reviews collected and released by Amazon to the research community. To create an annotated dataset, we select reviews in different categories as detailed in the Supplementary. Next, we sample candidate sentences for annotation in two iterations.
In the first iteration, we consider reviews written by customers with a verified purchase (i.e., the customer has bought the product about which he or she is writing the review). Given that counterfactual statements are infrequent, all prior works (Son et al., 2017;Yang et al., 2020a) have used clue phrase lists for selecting data for human annotation. Following this practice, we select sentences that contain exactly one clue phrase from our precompiled clue phrase lists for each language. We remove sentences that are exceedingly long (more than 512 tokens) or short (less than 10 tokens).
2 https://s3.amazonaws.com/amazon-reviews-pds/ readme.html Shorter sentences might not contain sufficient information for a human annotator to decide whether it is a counterfactual statement, whereas longer sentences are likely to contain various other information besides counterfactuals.
The above-mentioned first iteration might produce a biased dataset in the sense that all sentences contain counterfactual clues from the predefined lists. There are two possible drawbacks in this selection method. First, the manually compiled clue phrase lists might not cover all the different ways in which we can express a counterfactual in a particular language. Therefore, the sentences selected using the clue phrase lists might have coverage issues. Second, a counterfactual classification model might assign high confidence scores for some high precision clue phrases (e.g., "wish" for English). Such a classifier is likely to perform poorly on test data that do not use clue phrases for expressing counterfactuality. On the contrary, adding sentences with no clue words to the dataset might result in a greater bias: those additional sentences are likely to be negative examples, and thus discriminatory power of the clue phrases can get amplified. Later in our experiments, we empirically evaluate the effect of selection bias due to the reliance on clue phrases.
To address the selection bias, in addition to the sentences selected in the first iteration, we conduct a second iteration where we select sentences that do not contain counterfactual clues from our lists. For this purpose, we create sentence embeddings for each sentence selected in the first iteration. We use a pretrained multilingual BERT model 3 . We then use k-means clustering to cluster these sentences into k = 100 clusters. We assume each cluster represents some aspect of a product, and represented by its centroid. Next, we pick sentences that do not contain the clue phrases, compute their sentence embeddings, and measure the similarity to each of the centroids. For each centroid we select the top n most similar sentences for manual annotation. We set n such that we obtain an approximately equal number of sentences to the number of sentences that contain clue phrases selected in the first iteration. All selected sentences are manually annotated for counterfactuality as described in § 3.2.

Annotation
The annotators were provided guidelines with definitions, extensive examples and counterexamples. Briefly, counterfactual statements were identified if they belong to any of the counterfactual types described in § 3. If any part of a sentence contains a counterfactual, then we consider the entire sentence to be a counterfactual. This annotation process increases the number of counterfactual examples and the coverage across the counterfactual types in the dataset, thereby improving the class imbalance. We require that at least 90% of the sentences have agreement of 2 professional linguists (2 out of 2 agreement), the rest at most 10% cases had a third linguist to resolve the disagreement (2 out of 3 agreement).

Dataset Statistics
The basic dataset statistics can be found in Table 1. We present two versions of the English dataset: EN contains only sentences filtered by the clue words, EN-ext is a superset of EN enriched by sentences with no clue words as described above. The clue-based dataset EN contains about 1/5-th of positive examples, while its extended version contains 1/10-th of counterfactuals. Only 76 out of 4977 added sentences were labelled positively. DE dataset contains 69.1% and JP contains 9.5% of counterfactuals.
The summary of clue phrase distributions in positive and negative classes is shown in Table 2. Interestingly, English and German lists have approximately the same number of clues, but the precision for German clues is much higher, resulting in more counterfactual statements being extracted using those clue phrases. On the contrary, the Japanese list has the largest number of clues, yet results in the lowest precision. The specification of counterfactual clue phrases for Japanese is a linguistically hard problem because the meaning of the clues is highly context dependent. The large number of Japanese clue phrases is due to the orthographic variations present in Japanese where the same phrase can be written using kanji, hiragana, katakana characters or a mixture of them. Because we were able to select a sufficiently large datasets for German and Japanese using the clue phrases, we did not consider the second iteration step described in § 3.1 for those languages.

Comparison with Existing Datasets
We compare the multilingual counterfactual dataset we create against existing datasets in Table 3. Our dataset is well-aligned with the two other existing datasets in the sense that we use the same definition of a counterfactual, keep a similar percentage of positive examples, and use similar keywords for dataset construction. These properties ensure that our dataset of product reviews can be used on its own, as well as organically combined with the existing datasets from other domains. A distinctive feature of our dataset is its coverage of a novel domain, e-commerce reviews, which is not covered by any of the existing counterfactual datasets. Furthermore, our dataset is available for three languages: English, German, and Japanese. This is the first counterfactual dataset not limited to English language. Unlike previous works, which relied on crowdsourcing, we employ professional linguists to produce the lists of clue words and supervise the annotation. This ensures the high quality of the labelling.

Evaluations
We conduct a series of experiments to systematically evaluate several important factors related to counterfactuality such as (a) selection bias due to clue phrases ( § 4.1), (b) effect of merging multiple counterfactual datasets ( § 4.2), (c) use of machine  For evaluations in (a), (b), and (c), we fine-tune a widely used multilingual transformer model BERT (mBERT) (Devlin et al., 2019) to train a CFD model. The model is pretrained for the tasks of masked language modelling and next sentence prediction for 104 languages 4 and is used with the default parameter settings. The model is implemented using the Transformer. 5 library We fine-tune a linear layer on top of these pretrained language models for the CFD task using the training process as described next. 6 We use an 80%-20% train-test data split and tune hyperparameters via 5-fold cross-validation. Hyperparameters in the already pretrained transformer models are kept fixed. F1, Matthew's Correlation Coefficient (MCC;Boughorbel et al., 2017), and accuracy are used as evaluation metrics. MCC (∈ [−1, 1]) accounts for class imbalance and incorporates all correlations within the confusion matrix (Chicco and Jurman, 2020). Accuracy may be misleading in highly imbalanced datasets because a simple classification of all instances to the majority class has a high accuracy. However, for consistency with prior work, we report all three evaluation metrics in this paper. All the reported results are averaged over at least 3 independently trained models initialised with the same hyperparameter values. For tokenisation, unless the tokeniser is prespecified for the model, we use word tokenize 4 https://huggingface.co/ bert-base-multilingual-uncased 5 https://github.com/huggingface/transformers 6 See Supplementary for the details on fine-tuning. from nltk.tokenize.punkt 7 for English and German languages; and MeCab 8 as the morphological analyser for Japanese.

Selection Bias due to Clue Phrases
To evaluate the effectiveness of clue phrases for selecting sentences for human annotation and any selection bias due to this process, we fine-tune mBERT with and without masking the clue phrases.
Classification performance values are shown in Table 4. Overall, we see that no mask (training without masking) returns slightly better performance than mask (training with masking), however the differences are not statistically significant. This is reassuring because it shows that the sentence embeddings produced by mBERT generalise well beyond the clue phrases used to select sentences for manual annotation. On the other hand, if a CFD model had simply memorised the clue phrases and was classifying based on the occurrences of the clue phrases in a sentence, we would expect a drop in classification performance in no mask setting due to overfitting to the clue phrases that are not observed in the test data. Indeed for EN where all sentences contain clue phrases, we see a slight drop in all evaluation measure for no mask relative to mask, which we believe is due to this overfitting effect. The performance on JP is the lowest among all languages compared. This could be attributed to the tokenisation issues and lack of Japanese coverage in mBERT. Many counterfactual clues in Japanese are parts of verb/adjective inflections, which can get split/removed during the tokenisation.   masked (subscript m) and non-masked (subscript nm) settings. In all datasets the recall is higher than precision for both masked and non-masked versions due to dataset imbalance with an underrepresented positive class. The number of positive examples misclassified under masked and non-masked settings are typically very small. We see that the CFD model trained on EN-ext has a higher recall, but lower precision than the one on EN . Most of the added examples in EN-ext are negatives, which makes it hard to maintain a high precision.

Cross-Dataset Adaptation
To study the compatibility of our dataset with existing datasets, we train a CFD  use cover a narrow subdomain compared to the domains in SemEval . Interestingly, the CFD model trained on Comb reports the best performance across all measures, indicating that our dataset is compatible with SemEval and can be used in conjunction with existing datasets to train better CFD models.

Cross-Lingual Transfer via MT
Considering the costs involved in manually annotating counterfactual statements for each language, a frugal alternative would be to train a model for English and then apply it on test sentences in a target language of interest, which are translated into English using a machine translation (MT) system. To evaluate this possibility, we first translate the German and Japanese CFD datasets into English (denoted respectively by DE-EN and JP-EN ) using Amazon MT. 9 Next, we train separate English CFD models using EN , EN-ext and SemEval datasets, and apply those models on DE-EN and JP-EN . As shown in Table 7, the MCC values for the MTbased CFD model are significantly lower than that for the corresponding in-language baseline, which is trained using the target language data. Therefore, simply applying MT on test data is not an alternative to annotating counterfactual datasets from scratch for a novel target language. This result shows the importance of developing counterfactual datasets for languages other than English, which has not been done prior to this work. Moreover,

Sentence Encoders and Classifiers
We evaluate the effect of the sentence encoding and binary classification methods on the performance of CFD using multiple settings.

Bag-of-N-grams (BoN):
We represent a sentence using tf-idf weighted unigrams and bi-grams and ignore n-grams with a frequency less than 2 or more than 95% of the frequency distribution.  Pretrained Language Models Along with mBERT, we fine-tune a linear layer for CFD task on top of two following pretrained transformer models: XLM model (Conneau and Lample, 2019) 11 and base XLM-RoBERTa model (Conneau et al., 2020). 12 Both models were trained for the task of masked language modelling for 100 languages.

Results
Here we extend our experiment with clue word masking. For the transformer-based models we mask the clue words similar to mBERT. For the traditional ML methods we remove the clue words from the sentences before tokenization.
The results with and without masking are reported in Table 8 (F1 and Accuracy are reported in the Supplementary). First, we note that masking decreases the performance of all classifiers on all datasets. Transformer-based classifiers are the least affected by masking: they are able to learn semantic dependencies from the remaining text. We could also say that transformers are the least affected by the data-selection bias as they do not rely on the clue words. Traditional ML methods with BoN features are affected by masking the most: they seem to use clue words for discrimination. Interestingly, for these methods the performance drops equally for clue-based EN and enriched EN-ext datasets. This could indicate that in both cases the classifier relies on the clue words.
Overall transformer-based models (especially XLM-RoBERTa) perform the best across all datsets except for JP . For JP the best performance is obtained by an SVM model with BoN features. This could indicate that for Japanese, a language-specific tokenisation works for the lexicalised (BoN) models better than the languageindependent subtokenisation methods such as Byte Pair Encoding (BPE; Sennrich et al., 2016) that are used when training contextualised transformerbased sentence encoders. The former preserves more information than the latter at the expense of a sparser and larger feature space (Bollegala et al., 2020). Transformer-based masked language models on the other hand require subtokenisation as they must use a smaller vocabulary to make the token prediction task efficient (Yang et al., 2018;Li et al., 2019). In general, unlike the simpler word embedding and bag of words approaches, large pretrained contextualized embeddings maintain high test performance according to the reported evaluation metrics. We note that these also converged after a few epochs using a relatively small number of labelled instances, based on the model with the best 5-fold validation accuracy. Hence, contextualized embeddings can identify various context-dependent counterfactuals from a diverse range of reviews using a small number of mini-batch gradient updates of a single linear layer. Among the different sentence embedding methods compared, the best performance is reported by XLM-RoBERTa.
Between the two baselines, we see that using word embeddings to represent the sentences does not offer clear benefits for traditional ML methods and BoN features are sufficient. However, embedding based methods suffer generally a smaller performance drop when clues are masked. This suggests that embeddings provide a more general and robust representation of counterfactuals in the semantic space than BoN features.

Conclusion
We annotated a multilingual counterfactual dataset using Amazon product reviews for English, German and Japanese languages. Experimental results show that our English dataset is compatible with the previously proposed SemEval-2020 Task 5 dataset. Moreover, the CFD models trained using our dataset are relatively robust against selection bias due to clue phrases. Simply applying MT on test data results in poor cross-lingual classifica-tion performance, indicating the need for languagespecific CFD datasets.

Ethical Considerations
In this work, we annotated a multilingual dataset covering counterfactual statements. Moreover, we train CFD models using different sentence representation methods and binary classification algorithms. In this section, we discuss the ethical considerations related to these contributions.
With regard to the dataset being released, all sentences that are included in the dataset were selected from a publicly available Amazon product review dataset. In particular, we do not collect or release any additional product reviews as part of this paper. Moreover, we have manually verified that the sentences in our dataset do not contain any customer sensitive information. However, product reviews do often contain subjective opinions, which can sometimes be socially biased. We do not filter out any such biases.
We use two pretrained sentence encoders, mBERT and XLM-RoBERTa, when training the CFD models. It has been reported that pretrained masked language model encode unfair social biases such as gender, racial and religious biases (Bommasani et al., 2020). Although we have evaluated ourselves the mBERT and XLM-RoBERTa based CFD models that we use in our experiments, we suspect any social biases encoded in these pretrained masked language models could propagate into the CFD models that we train. In particular, these social biases could be further amplified during the CFD model training process, if the counterfactual statements in the training data also contain such biases. Debiasing masked language models is an active research field (Kaneko and Bollegala, 2021) and we plan to evaluate the social biases in CFD models in our future work.

Supplementary Materials
A Fine-tuned multilingual BERT for counterfactual classification Given that we select mBERT (Devlin et al., 2019) as the main classification method in the paper, we describe how the original BERT architecture is adapted for fine-tuned for CF classification. Consider a dataset D = {(X i , y i )} m i=1 for D ∈ D and a sample s := (X, y) where the sentence X := (x 1 , . . . x n ) with n being the number of words x ∈ X. We can represent a word as an input embedding x w ∈ R d , which has a corresponding target vector y. In the pre-trained transformer models we use, X i is represented by 3 types of embeddings; word embeddings (X w ∈ R n×d ), segment embeddings (X s ∈ R n×d ) and position embeddings (X p ∈ R n×d ), where d is the dimensionality of each embedding matrix. The self-attention block in a transformer mainly consists of three sets of parameters: the query parameters Q ∈ R d×l , the key parameters K ∈ R d×l and the value parameters V ∈ R d×o . For 12 attention heads (as in BERTbase), we express the forward pass as follows: The last hidden representations of both directions are then concatenated Z := ← − Z − → Z and projected using a final linear layer W ∈ R d followed by a sigmoid function σ(·) to produce a probability estimateŷ, as shown in (5). As in the original BERT paper, WordPiece embeddings (Wu et al., 2016) are used with a vocabulary size of 30,000. Words from (step-3) that are used for filtering the sentences are masked using a [PAD] token to ensure the model does not simply learn to correctly classify some samples based on the association of these tokens with counterfacts. A linear layer is then fine-tuned on top of the hidden state, h X, [CLS] emitted corresponding to the [CLS] token. This fine-tunable linear layer is then used to predict whether the sentence is counterfactual or not, as shown in Equation 5, where B ⊂ D is a mini-batch and L ce is the cross-entropy loss.
Configurations For the mBERT counterfactual model we use BERT-base, which uses 12 Transformer blocks, 12 self-attention heads with a hidden size of 768. The default size of 512 is used for the sentence length and the sentence representation is taken as the final hidden state of the first [CLS] token. This model is already pre-trained and we fine-tune a linear layer W on top of BERT, which is fed to through a sigmoid function σ as p(c|h) = σ(Wh) where c is the binary class label and we maximize the log-probability of correctly predicting the ground truth label.

B Matthews Correlation Coefficient
Unlike metrics such as F1, MCC accounts for class imbalance and incorporates all correlations within the confusion matrix (Chicco and Jurman, 2020). For MCC, the range is [-1, 1] where 1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction.

E Hardware Used
All transformer, RNN and CNN models were trained using a GeForce NVIDIA GTX 1070 GPU which has 8GB GDDR5 Memory.

F Model Configuration and Hyperparameter Settings
BERT-base uses 12 Transformer blocks, 12 selfattention heads with a hidden size of 768. The default size of 512 is used for the sentence length and the sentence representation is taken as the final hidden state of the first [CLS] token. A finetuned linear layer W is used on top of BERT-base, which is fed to through a sigmoid function σ as p(c|h) = σ(Wh) where c is used to calibrate the class probability estimate and we maximize the logprobability of correctly predicting the ground truth label. Table 11 shows the pretrained model configurations that were already predefined before our experiments. The number of (Num.) hidden groups here are the number of groups for the hidden layers where parameters in the same group are shared. The intermediate size is the dimensionality of the feed-forward layers of the the Transformer encoder. The 'Max Position Embeddings' is the maximum sequence length that the model can deal with.
We now detail the hyperparameter settings for transformer models and the baselines. We note that all hyperparameter settings were performed using a manual search over development data.

F.1 Transformer Model Hyperparameters
We did not change the original hyperparameter settings that were used for the original pretraining of each transformer model. The hyperparameter settings for these pretrained models can be found in the class arguments python documentation in each configuration python file in the https://github.com/huggingface/transformers/ blob/master/src/transformers/ e.g configuration .py and are also summarized in Table 11.
For fine-tuning transformer models, we manually tested different combinations of a subset of hyperparameters including the learning rates {50 −4 , 10 −5 , 50 −5 }, batch sizes {16, 32, 128}, warmup proportion {0, 0.1} and which is a hyperparameter in the adaptive momentum (adam) optimizer. Please refer to the huggingface documentation at https://github.com/huggingface/transformers for further details on each specific model e.g at https://github.com/huggingface/transformers/blob/ master/src/transformers/modeling_bert.py, and also for the details of the architecture for BertForSe-quenceClassification pytorch class that is used for our sentence classification and likewise for the remaining models.
Fine-tuning all language models with a sentence classifier took less than two and half hours for all models. For example, for the largest transformer model we used, BERT, the estimated average runtime for a full epoch with batch size 16 (of 2, 682 training samples) is 184.13 seconds. In the worst case, if the model does not already converge early and all 50 training epochs are carried out, training lasts for 2 hour and 30 minutes.

F.2 Baseline Hyperparameters
SVM Classifier: A radial basis function was used as the nonlinear kernel, tested with an 2 regularization term settings of C = {0.01, 0.1, 1}, while the kernel coefficient γ is autotuned by the scikit-learn python package and class weights are used inversely proportional to the number of samples in each class. To calibrate probability estimates for AUC scores, we use Platt's scaling (Platt et al., 1999).

Decision Tree and Random Forest Classifiers:
We use 20 decision tree classifiers with no restriction on tree depth and the minimum number of samples required to split an internal node is set to 2. The criterion for splitting nodes is the Gini importance (Gini, 1912

G Further Details on the Datasets
Misclassifications of Reviews Containing No Counterfactuals Models If you workout regularly, an extra set of 'expendable' earbuds like these is a must-have.
B XR X I put over 500 songs on it the first day and still have around 17 GB left, probably could have done with a much smaller one.
B XR X If you have a similar build compared to mine, buy this shirt without hesitation.
B XR X If they ever need replacing I would definitely buy these again.
B XR X If this device for whatever reason fails within a year or two, I think I would look to buy the same machine again.
B XR X I was hoping she would be able to grow in it but it fits her now with no room to grow.
B XR X I must have read reviews on about 20 different models.
B XR X Because it is fleece, if you are in the US, I would suggest a second cool water rinse with a touch of fabric softener.
B XR X There are ways to get it like you want it but its not as easy as it could have been.
B XR X Could be improve with a size adjustment and chin strap.
B XR X If you need more desk space and have a location where you can use a wall mount for your monitors, this thing is the way to go.
B XR X It should be about $20 cheaper to make it worth while.
B XR X

Misclassifications of Counterfactual Reviews
At the end of a series like The Wheel of Time, it might be appropriate to lament the loss of familiar characters. B XR X You would have to be 5'10 and super thin to fit into these.
B XR X From the picture the dress looks like it should be long enough for someone at lease 5' 6.
B XR X To say "the usual awesome Stephen King novel" would be an understatement.
B XR X I don't like to go into the plot a lot unless the blurb doesn't represent the book fairly.
B XR X I've thought about it, and I guess that's because what happened to the characters in Missing are stuff that I could imagine happening to me as well.
B XR For the price that this particular seller charged for this T-shirt, the material SHOULD be HEAVY-DUTY. B X If one can put aside their religious beliefs about heaven and hell I think they will find this to be something they've always known deep inside about the afterlife. XR X If you think leakage is a problem it really isn't they are as bad as a pair of ear-buds. XR X