IDIAPers @ Causal News Corpus 2022: Efficient Causal Relation Identification Through a Prompt-based Few-shot Approach

In this paper, we describe our participation in the subtask 1 of CASE-2022, Event Causality Identification with Casual News Corpus. We address the Causal Relation Identification (CRI) task by exploiting a set of simple yet complementary techniques for fine-tuning language models (LMs) on a few annotated examples (i.e., a few-shot configuration).We follow a prompt-based prediction approach for fine-tuning LMs in which the CRI task is treated as a masked language modeling problem (MLM). This approach allows LMs natively pre-trained on MLM tasks to directly generate textual responses to CRI-specific prompts.We compare the performance of this method against ensemble techniques trained on the entire dataset.Our best-performing submission was fine-tuned with only 256 instances per class, 15.7% of the all available data, and yet obtained the second-best precision (0.82), third-best accuracy (0.82), and an F1-score (0.85) very close to what was reported by the winner team (0.86).

Causal relation identification aims to predict whether or not there exists a cause-effect relation between a pair of events mentioned in a given text. For example, in the sentence "Protests spread to 15 towns and resulted in the destruction of property", the automatic causal identification system must be able to realize that there is cause-effect relation between the events "protest" and "destruction".
Hence, understanding causal relations within a text is an essential aspect of natural language processing (NLP) and understanding (NLU) (Ayyanar et al., 2019a;Li et al., 2021;Tan et al., 2022c). Once the causal information is identified within a 1 Code available at https://github.com/idiap/cncsharedtask. text, such knowledge becomes beneficial for many other downstream NLP tasks, e.g., Information Extraction, Question Answering, Text Summarization (Ayyanar et al., 2019a;Man et al., 2022). However, due to the ambiguity and diversity in written documents, causality identification is not easy and remains a challenging problem.
The Event Causality Identification with Causal News Corpus (CASE-2022) shared task (Tan et al., 2022a) addresses this problem on a recently created corpus named the Causal News Corpus (CNC) (Tan et al., 2022b). Contrary to previous existing causality corpora, the CNC dataset, manually annotated by experts, incorporates a broader set of causal linguistic constructions, i.e., not only limited to explicit constructions, resulting in a more challenging dataset.
In this paper, we describe our followed methodology for addressing the causal event classification shared task (subtask 1) during the CASE-2022 competition (Tan et al., 2022a). 2 Our primary method, based on a few-shot configuration, follows a prompt-based approach for fine-tuning the language model (LM). The intuitive idea of this approach is to allow the LM to directly auto-complete natural language prompts. Following this technique, we leverage the LM's knowledge and let it decide the correct label of the input sequence. Additionally, we evaluate the performance of ensemble techniques trained using the entire dataset available. Our results demonstrate that our fewshot, prompt-based, fine-tuning approach can generalize well even when using as few as 256 samples per class for training, outperforming ensemble techniques trained with the entire dataset, as well as most of other teams' submissions.
The rest of the paper is organized as follows.
Section 2 describes relevant related work, Section 3 describes the components of our main method, namely the prompt-based approach. Section 4 describes the experimental setup, i.e., datasets, additional baselines, experiments configuration and obtained results. Finally, Section 5 depicts our main conclusions and future work directions.

Related Work
Previous work on causal relation identification varies from knowledge-based to deep neural network approaches (Deep-NN). Knowledge-based systems rely on linguistic patterns extracted using an exhaustive exploration of the data, where lexico-semantic and syntactic analysis lead to the identification of relevant structures and keywords that depict the presence of a causal relation in the text (Garcia, 1997;Khoo et al., 2000). Although interpretable, these methods require a lot of human effort to generate relevant patterns and result in models that are not readily applicable in different domains. Statistical machine learning (ML) approaches leave to the selected algorithm to find patterns in the data on the basis of the manual annotation. Traditionally, using different NLP tools, it is possible to compute various features for a given collection and apply any ML pipeline to train a causality relation classifier, e.g., (Rutherford and Xue, 2014;Hidey and McKeown, 2016). However, one main disadvantage of these techniques is the language dependency and error propagation of the NLP tools, e.g., syntactic parsers.
Finally, recent approaches based on Deep-NN have become popular, given their powerful representation learning ability. Typical approaches include convolutional neural networks (Ayyanar et al., 2019b), long short-term memory networks (Li et al., 2021), and pre-trained transformer-based LMs such as BERT (Devlin et al., 2019), where following a standard fine-tuning approach makes possible the detection of causality relations (Tan et al., 2022c;Khetan et al., 2022;Fajcik et al., 2020). Normally, these methods involve high computational costs and large amounts of labeled data. However, in this work, we show that pre-trained LMs can still be effective even when fine-tuned with very few instances.
Contrary to previous work, we evaluate the effectiveness of very recent prompt-based prediction approaches under a few-shot configuration for causal relation identification.

Prompt-Based Approach
In the "pre-train, prompt, and predict" paradigm, unlike the standard "pre-train and fine-tune" paradigm, instead of adapting pre-trained LMs to downstream tasks via objective engineering, 3 downstream tasks are reformulated to look more like those solved during the LM pre-training phase (Liu et al., 2021). More precisely, promptbased prediction treats the downstream task as a masked language modeling problem, where the model directly generates a textual response (referred to as a label word) to a given prompt defined by a task-specific template (Gao et al., 2020). For instance, when identifying the sentiment of a movie review like "I love this movie." we may continue with "Overall, it was a [MASK] movie." and ask the LM to fill the mask with a sentimentbearing word. In this example, the original input text x ("I love this movie.") is modified using the template "[x] Overall, it was a [MASK] movie." into a textual string prompt x ′ in which the mask will be filled with a label word. Some examples of label words for this example could be "fantastic" or "boring".
In the case of classification tasks, in addition to defining a set of possible label words, it is necessary to define a mapping between each one and the actual output labels. For instance, if labels + and − refer to positive and negative sentiment, respectively, "fantastic" in previous example could be mapped to output label +, and "boring" to −.
Formally, let L be a pre-trained language model, f t (x) a function that converts the input x into a prompt by instantiating template t which contains one [MASK] token, mask. Let word : Y → W be a mapping from the task label space, Y, to the label words set, W. Then, the classification task is converted to a masked language modeling (MLM) task in which the probability of predicting class y ∈ Y is modeled as:  Figure 1: Augmented prompt-based classification for causality identification task. First, the input instance x = "Soldiers were hurt in the attacks" is converted into three different input prompts by applying f ′ t (x) three times. Then, these three prompts are given to a RoBERTa model, and one logit vector is obtained for each. These vectors are then averaged, and the word with the highest score, "causal", is selected. Finally, this word is mapped to its corresponding class, and x is classified as positive. Note that, in this example, we have the following word-to-class label mapping word(positive) = "causal" and word(negative) = "random".
when fine-tuning L to minimize the cross-entropy loss, the pre-trained weights w v are re-used, and there's no need to introduce any new parameter. On the contrary, with standard fine-tuning a taskspecific head, sof tmax(W o h [CLS] ), has to be added, with new task-specific learnable parameters W o ∈ R |Y|×d , which increases the gap between pre-training and fine-tuning.
Hereafter we will refer to the "causal" and "noncausal" classes as "positive" (+) and "negative" (−) respectively. In addition, and following previous work by Gao et al. (2020), we append one answered prompt for each class to the input prompt as demonstrations. 4 More precisely, let Y = {+, −} be the set of labels for the binary causality identification task, let t ← v be the template t in which its [MASK] token has been filled with word v, and w y = word(y) the word label for class y ∈ Y, then we redefine f t (x) in Equation 1 as f ′ t (x) defined as: where ∥ is the string concatenation operator, and x y is an instance of class y randomly sampled from the training set. Figure 1, depicts an example of three different input prompts are shown by applying f ′ t (x) three times to the input instance x. Classification process: the process is illustrated in Figure 1. First, the input instance x is converted into d different input prompts by applying f ′ t (x), d times. Then, each input prompt is given to the LM to obtain d logit vectors holding the word scores for the mask in each prompt. A simple ensemble scheme is then applied by averaging all d logit vectors, and the word label with the highest score is selected, which is finally mapped to its corresponding class y using mapping word(y). Training and model selection: for developing our prompt-based models, we performed a simplified version of the process described in previous work by Gao et al. (2020). Namely, we carried out the following six steps: Step 1: we created a new training set, τ k , by extracting k instances per class from the original train partition, and used the remaining 2925−2×k instances as a large evaluation set δ T −k (dataset stats are given in Table 2).
Step 2: in order to add demonstrations to a given input x (see Equation 1), we uniformly sampled x − and x + from the top-50% most similar instances in τ k . 5 To do so, we pre-computed the sentence embeddings of training instances using a pre-trained SBERT (Reimers and Gurevych, 2019) model, and cosine distance was used as a similarity metric.
Step 3: using "causal" and "random" as word labels, 6 the next step was to generate candidate templates automatically using T5. First, each training instance x of class y in τ k was converted to "[x]<P>word(y)<S>" where <P> and <S> are T5 mask tokens, and used a 100 wide beam search to decode multiple template candidates by filling <P> and <S> tokens.
Step 4: next step was sorting all 100 final candidate templates by F1 score. However, since this is a time-consuming step, a subset of the evaluation set was used by sampling 256 unique positive and negative instances from δ T −k . Note that no finetuning is used at this point, just the out-of-the-box pre-trained LM.
Step 5: we selected the top-10 best-performing templates as final candidates. For each candidate template we fine-tuned the LM as a MLM task (see Equation 1) on the training set, τ k , evaluating it on the complete evaluation set, δ T −k .
Step 6: finally, the model with the best F1 score on the official dev set was selected as a candidate for submission -we also checked that the F1 score on δ T −k was among the first ones too (if not first). Note that in this step we're evaluating the model on unseen data since the official dev set is being used as an unofficial test set.
The above process was repeated varying the number k of training instances, with k = 256, 356, 512, and 1000; 7 the number d of input prompts to ensemble during classification stage, with d from 1 to 9; and using RoBERTa (large and base), and DeBERTa V3 (base) as pre-trained LMs. In step 5, models were fine-tuned for a maximum of 1000 steps using AdamW (Loshchilov and Hutter, 2019) optimizer (β 1 =0.9, β 2 =0.999, ϵ=1e−8) with a learning rate of γ=1e−5 with no weight de-7 Inspired by evidence showing a performance saturation when k = 256 (Figure 3 in Gao et al. (2020)), compared to standard fine-tuning on the entire dataset, we decided to start from this value.

Label
Train Dev Test Total  Table 2: Number of positive (causal) and negative (noncausal) instances in the train, dev, and test sets of the shared task. We refer the interested reader to (Tan et al., 2022a) to know more details about the data and the labeling process.
cay (λ=0). Models were evaluated every 100 steps and check-pointed when new best F1 scores were obtained.

Results & Discussion
In this section we provide the details of the employed dataset, a set of additional experiments based on recent ensemble techniques, and the final configuration of our submitted runs to the subtask 1 of CASE 2022.

Dataset
As mentioned earlier, the main goal of subtask 1 of CASE-2022 is to classify whether or not a given sentence contains a cause-effect relation. Thus, systems have to be able to predict Causal or Noncausal labels per sentence. Table 2 contains a few statistics regarding the distribution of the classes in the train, dev, and test partitions.

Ensemble-based Approach
We also performed several ensembles of different fine-tuned LMs to increase the generalization and compensate for the overfitting of the models. We followed the approach described in Fajcik et al. (2019), called TOP-N fusion. In this formulation, we first define a set of M pre-trained LMs, varying the training seed. TOP-N fusion starts by choosing one uniformly random model from the set, which is added to the ensemble. Next, it randomly shuffles the rest of the models and tries adding them into the ensemble once, as long as the F1 score improves. Each time a model is added to the ensemble, its performance gets measured. The model would stay in the ensemble only and only if it improved the overall performance. This aims at an iterative optimization of the ensemble's F1 score by averaging the output probabilities. As the selection process is stochastic, we repeat the process N =10000 times. We construct a new ensemble for each iteration, independently of the previous ones. Finally, we select the best performing ensemble for submission. Further details are given in Appendix B (Figure 2).

Official Submissions
Next, we describe each one of our submissions: Ensemble-10m: ensemble model described in subsection 4.2 with 10 final models obtained from a set of 150 initial ones (50 fine-tuned bert-base-cased, roberta-base, and deberta-v3-base models). Prompt-256: prompt-based roberta-large model with k=256 training instances per class, d=3 input prompts to ensemble during classification stage; and template t = "[x] This is not [MASK]". Prompt-1000: The same previous model but with t = "[x] There were no [MASK]ities in this", k=1000, and d=1. Ensemble-8p: ensemble model described in subsection 4.2 with 8 final models obtained from the top-50 best performing prompt-based models as the initial set. Prompt-356e: three prompt-base models trained with k=356 instances. The first two models have the same template as Prompt-1000 but with d=2 and 3, respectively. The third one uses the template t = "[x] The incident is not [MASK]" with d=1. 8 Finally, a simple majority voting ensemble among these three models generates the output. Table 1 shows the official results, both in dev and test partitions, for our five submissions. As expected, the ensemble of several LMs (Ensemble-10m) was able to obtain outstanding performance across several metrics during the validation phase (i.e., dev partition 9 ). However, the performance dropped significantly in the test partition (F1= 89.44 → F1= 83.70). On the contrary, our promptbased approach trained on 256 instances per class (Prompt-256) could generalize better on the test partition. Such submission obtained 2nd place in terms of precision (82.80%), 3rd in accuracy (82.64%), and 5th in F1 (85.08%) -the best F1 was 86.19%. However, the main advantage of our approach is that it allows the LM to be trained in a few-shot setting, making it harder for the model to overfit the data. Moreover, most of the available data can be kept and used for measuring the generalization power of the model instead. For instance, our bestperforming model (Prompt-256) was fine-tuned only on 15.7% of all available data, 10 allowing the remaining 84.3% to be used for evaluation and model selection (74.3% as evaluation set and 10% as our own test set). Therefore, model selection choice is more robust since the risk of performance drop on unseen data, such as the official test set, is expected to be lower.

Conclusions
This paper describes our participation in the CASE-2022 subtask 1. Our proposed approach uses a fewshot configuration in which a prompt-based model is fine-tuned using only 256 instances per class and yet was able to obtain remarkable results among all 16 participant teams. The comparison against traditional fine-tuning techniques, ensemble approaches, as well as the other participating models, show the potential of the proposed approach for better generalizing the posed task. For future work, we plan to perform further ablation studies when we have access to test set ground truth labels. For instance, measuring the dev-to-test performance drop in relation to k or the robustness against different training and demonstration sampling given a fixed k.