Strong and Light Baseline Models for Fact-Checking Joint Inference

How to combine several pieces of evidence to verify a claim is an interesting semantic task. Very complex methods have been proposed, combining different evidence vectors using an evidence interaction graph. In this paper, we show that in case of inference based on trans-former models, two effective approaches use either (i) a simple application of max pooling over the Transformer evidence vectors; or (ii) computing a weighted sum of the evidence vectors. Our experiments on the FEVER claim veriﬁcation task show that the methods above achieve the state of the art, constituting strong baseline for much more computationally complex methods.


Introduction
Automatic Fact Checking is quickly gaining attention of the NLP and AI communities. The FEVER.ai Fact Extraction and Verification Shared Task (Thorne et al., 2018) provides a benchmark for evaluating fact-checking systems. In FEVER, given a claim, C, and a collection of approximately five million Wikipedia pages, W , the task is to predict whether C is supported (SUP) or refuted (REF) by W , or whether there is not enough information (NEI) in W to support or refute C. If C is classified as SUP or REF, the respective evidence should be provided. Tab. 1 shows a FEVER claim and the gold-standard evidence refuting it.
The overall task is complex, as one needs to retrieve the documents that contain the evidence (document retrieval, DocIR), select relevant evidence (evidence selection, ES) and label the claim given the evidence (evidence reasoning, ER), which is the focus of our work. Formally, given a claim, C, and a list top K evidence sentences, (E 1 , ..., E K ), retrieved with DocIR and selected by ES respec- * Professor at the University of Trento.

Claim
Evidence Coeliac disease is not treated by maintaining a gluten-free diet. (REF) [(Coeliac disease, "The only known effective treatment is a strict lifelong gluten-free diet....")] There can be multiple inter-dependent evidence sentences per claim, thus joint modeling them allows for taking multiple clues into account, thus intuitively improving system accuracy. Indeed, individual sentences may not constitute standalone evidence, but they can contain several clues, which, together, can support or refute the claim. For example, Sentence 8 of the Gluten-free diet Wikipedia page, "..gluten-free diet is demonstrated as an effective treatment, but several studies show that about 79 % ... an incomplete recovery of the small bowel...", which is not listed as ground truth evidence for the claim, still supports the REF signal.
Given the above intuition, recent state-of-the-art (SOTA) approaches (Zhou et al., 2019;Ye et al., 2020;Zhong et al., 2020;Zhao et al., 2020) combine different pieces of evidence with graph networks, also increasing computational and space complexity. In this paper, we show that simple joint transformer-based methods achieve better performance than the best complex systems. Specifically, we (i) text-concatenate evidence sentences, (ii) apply max pooling to their individual embedding representation, or (iii) compute their weighted sum. Since June 1 st 2021, our baseline is sixth in terms of Label Accuracy (LA) and seventh in terms of FEVER score on the official task leaderboard 1 , where the absolute difference from the fourth top LA is 0.2%.
We believe our results are important to enable researchers to select the right scientific challenge, providing the appropriate baselines. For example, proposing complex models that are less accurate than our baselines can most likely mislead the research community, thus knowing our baselines can help to lead research in this area in the right directions. Additionally, our baselines are strong, simple to use, and easily reproducible, enabling fast comparison with innovative inference models.
2 Related work SOTA approaches. Most recent approaches encode claim and evidence texts using Transformerbased language representation models (LRM), such as BERT (Devlin et al., 2018), RoBERTa , and others. GEAR (Zhou et al., 2019) and KGAT    (Stammbach and Ash, 2020), simply textconcatenates evidence pieces and uses a RoBERTabased classifier, thus supporting our thesis that simple models can be very effective. On the other hand, they use additional DocIR components and data (MultiNLI (Williams et al., 2018) corpus) for fine-tuning. To the best of our knowledge, their code/output are not available online yet 2 , so we cannot compare to them directly at the moment. Baselines. The baselines in the above works, apart from the previous SOTA systems, consist in applying a transformer-based classifier to (i) concatenation of C and all E i , i = 1..K (Zhou et al., 2019;Zhong et al., 2020;Zhao et al., 2020); or (ii) separate (C, E i ) pairs, i = 1..K, and aggregating the results heuristically (Zhou et al., 2019). The latter also considered max-pooling and weighted-sum baselines, but used them only on subsets of the development set with multiple gold evidence pieces per claim. In this work, we use them in the full-scale setting.
3 Strong baseline models BERT for classification. BERT LRM and its version with the improved training procedure, RoBERTa, have obtained outstanding results on a number of NLP tasks. When using BERT-based architectures for classification, a special[CLS] 3 token is prepended to an input text sequence. Its embedding from the last layer of the transformer, and N is the number of classes 4 . Baseline approaches. We investigate four simple Transformer-based baseline approaches: Local, Concat, MaxPool, WgtSum.
The input to the task are a claim, C, and a list of top K evidence sentences selected by an ES component, E = {E i }, i = 1, .., K. Tab. 2 describes the input format. Following , we incorporate E i source page name into the input. We use cross-entropy loss to train all the models. Local: for each E i , we (i) use the standard 3-way classification Transformer-based model, T class , to get an evidence-level label prediction, P i , along with its corresponding l i = max(L i ) score, where L i ∈ R N is the logits vector produced by T class for E i ; (ii) sort the predictions list, P = [(P i , l i )], on l in the reverse order; (iii) create P , a sublist of P , where P i is not NEI and l i > 0. If P is not empty, P 1 is the claim label, otherwise it is NEI. We introduce P , because we want to capture the SUP/REF signal even if it is weaker compared to that of NEI. Concat: T class run on the input described in Tab  [SEP] and [CLS] are the standard "separator" and "classification" tokens used in BERTlike models. <psep> is delimiter separating page name from the evidence text. We use ". ", while It is inspired by the max pooling evidence aggregation procedure employed by (Hanselowski et al., 2018;Zhou et al., 2019). WgtSum: encodes each (C, E i ) pair with a transformer, computes the weighted sum h WgtSum is similar to the Zhou et al. (2019)'s attention baseline in the sense that we aggregate pieces of evidence representations via a weighted summation. However, differently from us, they obtain the weights by computing attention between the claim and the evidence hidden states. We refer to Concat, MaxPool and WgtSum as global systems.
Our code is available at https://github.com/iKernels/ reasoning-baselines. We use the pre-trained BERT and RoBERTa LRMs from the transformers 5 library, namely bert-base-cased, roberta-base and roberta-large. Training setup. We train for three epochs, with an evaluation checkpoint every 500 and 2500 training steps for global and local models correspondingly, thus having 14 checkpoints in total. We use K = 5 evidence pieces per claim. For all the models the batch size/number of gradient accumulation steps are 8/8 and 2/32 with base and large LRMs, respectively. We use Adam optimizer with slanted triangular learning rate (Howard and Ruder, 2018), cut f rac = 0.1, ratio of 32 6 .
When experimenting with roberta-base we tried 5 https://github.com/huggingface/ transformers 6 Standard values suggested in (Howard and Ruder, 2018) (Hanselowski et al., 2018), and the ES module selects relevant evidence (ES) via BERT-based system with pairwise loss. Following , when training and selecting the best checkpoint we use gold evidence completed with the non-gold evidence pieces retrieved by ES, so that the total amount of evidence pieces per claim is K. When evaluating on DEV we simply use top 5 evidence pieces retrieved by ES, i.e. the results in Sec. 4.2 are obtained on the 7 Scorer: https://github.com/sheffieldnlp/ fever-scorer 8 At least one of top 5 predicted evidences must be correct. 9 An evidence consists of one or more sentences. One claim can have multiple evidences. 10

Results
Tab. 4 reports the performance of the systems described in Sec. 3 on the official DEV set.
In previous work, systems employing KGAT ER architecture Ye et al., 2020) achieve top performance in terms of FEVER. KGAT ER input data are publicly available enabling us to conduct fair comparison. Lines 1 and 2 report KGAT performance as in . We integrated their implementation of the KGAT ER component into our pipeline and obtained performance numbers comparable to those published (lines 3, 1). Interestingly, our runs with roberta-base outperform the published results of KGAT runs with roberta-large (lines 4, 6, 9). We also include its best result (that is an upperbound of KGAT) with roberta-base that we obtained with the learning rate of 3e-5. KGAT with roberta-large and learning rate of 2e-5 further pushes the performance 1 point up, while the training with the learning rates of 3e-5 and 5e-5 did not converge. Local models. Lines 10-13 report performance of our local models with two evidence label aggregation heuristics. Heuristic 1 consists in applying the procedure described in Sec. 3 to the labels assigned to all evidence pieces by Local. Heuristic 2 is to simply pick the label assigned to the evidence sentence top-ranked by ES as in (Zhou et al., 2019). The aggregation heuristic 1 is more competitive. Global models. Lines 14-22 report performance of the Concat, WgtSum, MaxPool global systems, which all clearly outperform Local. Note, that in the Concat setting C and E i , i = 1, .., K, are concatenated, thus it is sensitive to the relative E i order. Overall, all three models perform comparably between each other and to KGAT (lines 14-22 vs 5-7). MaxPool and WgtSum marginally outperform Concat with roberta-large.
We also trained Concat with roberta-large setting K=1 both for training and predicting, i.e., using only top evidence piece retrieved by ES. The resulting LA of 79.57 is only approximately one point behind that of Concat (Line 16) and KGAT (Line 7). This suggests that good performance can be obtained on the FEVER dataset even without joint reasoning over multiple E i -s, and that there is still room for further improvement for the systems able to reason upon multiple evidence pieces. Also this could be partially attributed to the observation by Schuster et al. (2019) who showed that FEVER claims contain certain linguistic biases and BERT model fine-tuned on the claim texts only significantly outperforms the majority baseline. Schuster et al. (2019) proposed a debiased symmetric test set, but its instances are claim-evidence pairs. This means that K = 1, and thus we did not evaluate our baselines on it as with K = 1 they all become equivalent to Local. Comparison to the state of the art. Tab. 5 compares the performance of MaxPool and WgtSum to that of the SOTA systems as of June 1st, 2021. Our simple baselines outperform all the other systems on DEV, but we may have overfitted on it, as we report the performance of the best checkpoint.  On the blind TEST data, WgtSum with robertalarge scores seventh in terms of FEVER and sixth in terms of LA on the official Codalab leaderboard.
Despite our best efforts, we were not able to track the publications related to the leaderboard submissions #1-#6. We do not know whether their superior performance is due to a better ER approach, a stronger LRM with billions of parameters, or to a better DocIR/ES. In the latter two cases, the baselines in this work still remain relevant.
The best-performing system with published description, DOMLIN++, uses roberta-large and the Concat approach. We cannot compare the results of our ER model directly, since they use a different ES system which might have better evidence recall. Note that we still marginally outperform them in terms of LA. This may indicate that even though our gold evidence recall may be lower due to a possibly less powerful DocIR/ES pipeline (resulting in lower FEVER score), we are still able to predict a correct label given the evidence sentences we have at our disposition. Then, they do additional pre-training on MultiNLI, while we do not exploit any external corpora.
Qualitative analysis We compared the outputs of the Concat, MaxPool, WgtSum and KGAT systems. We analyzed 50 DEV set examples where only one out of four systems produced the correct label. We aimed to understand the reason behind the correct prediction, but we have not observed any patterns explaining why one system outperforms the others. The systems seem to be equivalent in their abilities.
When analyzing the WgtSum output, we observed that when summing the weighted distributed representations of evidence pieces retrieved by KGAT ES for a specific claim (see Sec. 3), it tends to assign higher weights to the evidence pieces which are correct according to the gold standard.

Conclusion
We have proposed lightweight strong baselines for the FEVER fact-checking task and showed that they can outperform heavier models on the official leaderboard with blind TEST set. In our future work, we plan to capitalize from our results to build systems that can effectively trade-off efficiency for accuracy.
A Learning rate selection.
When experimenting with roberta-base we tried learning rates (lr) [1e-5; 5e-5] with the step of 1e-5. Table 6 summarizes the results. The results obtained with the learning rates in range [2e-5; 5e-5] are very similar, so we used the learning rate of 2e-5 in the majority of our experiments in this paper.