Exploring Listwise Evidence Reasoning with T5 for Fact Verification

This work explores a framework for fact verification that leverages pretrained sequence-to-sequence transformer models for sentence selection and label prediction, two key sub-tasks in fact verification. Most notably, improving on previous pointwise aggregation approaches for label prediction, we take advantage of T5 using a listwise approach coupled with data augmentation. With this enhancement, we observe that our label prediction stage is more robust to noise and capable of verifying complex claims by jointly reasoning over multiple pieces of evidence. Experimental results on the FEVER task show that our system attains a FEVER score of 75.87% on the blind test set. This puts our approach atop the competitive FEVER leaderboard at the time of our work, scoring higher than the second place submission by almost two points in label accuracy and over one point in FEVER score.


Introduction
In recent years, the Internet has become an effective platform for creating and sharing content to large audiences. Unfortunately, there have been occurrences of bad actors taking advantage of this to propagate manipulative information for their benefit, often to the point of spreading misinformation. With the large amount of data being generated on the Internet each day, it is infeasible to manually verify it all, motivating recent research into automated fact verification.
In this work, we explore a fact verification framework built with the pretrained sequence-tosequence transformer T5 (Raffel et al., 2020) as its backbone which we call LisT5. Within a standard three-stage architecture, we focus mostly on the label prediction problem. We adopt a "listwise approach", where all candidate sentences that form the evidence set of a claim are considered together.
Our main contribution is a data augmentation technique that involves deliberately introducing noise into training data to combat data sparsity and produce a more robust model. At its introduction, a full pipeline using our techniques represents the state of the art, achieving the top scoring run on the FEVER leaderboard. An additional minor contribution exploits named entities during the sentence selection phase, which has a small but noticeable effect on generating a better candidate set for downstream label prediction. We believe that these techniques can be potentially valuable to a broader range of NLP tasks that also involve aggregation of information from upstream retrieval models.

Background and Related Work
As this work focuses on the Fact Extraction and VERification (FEVER) task (Thorne et al., 2018), 1 we begin by briefly describing the task setup. We are given a textual claim q, to be verified against a corpus comprised of a subset of Wikipedia. Each claim is associated with a three-way veracity label v(q) ∈ {SUPPORTS, NOINFO, REFUTES} and a set of reference sentences S(q) that provide support. 2 An example claim q, its label v(q), and supporting evidence S(q) are given in Figure 1.
The primary evaluation metric, FEVER score, is computed as the proportion of claims where the system has predicted the correct veracity label conditioned on also having retrieved a complete set of reference sentences. Most current systems adopt a three-stage approach to this task, comprising document retrieval, sentence selection, and label prediction. In this work, our contributions are focused on the second and third sub-tasks; for document retrieval, we simply augment current best practices with BM25 (Yang et al., 2017;Lin et al., 2021).

403
Claim: The Rodney King riots took place in the most populous county in the USA. Evidence 1 (wiki/Los Angeles Riots): The 1992 Los Angeles riots, also known as the Rodney King riots were a series of riots, lootings, arsons, and civil disturbances that occurred in Los Angeles County, California in April and May 1992. Evidence 2 (wiki/Los Angeles County): Los Angeles County, officially the County of Los Angeles, is the most populous county in the USA. Label: SUPPORTS Figure 1: An example claim and its corresponding evidence and label from the FEVER dataset.
By construction, the veracity of each claim is determined by the (candidate) supporting sentences, taken together. One simple and popular approach to fact extraction and verification is to consider the veracity of the claim with respect to each candidate independently (i.e., classification), and then aggregate the evidence (Hanselowski et al., 2018;Zhou et al., 2019;Soleimani et al., 2019;Liu et al., 2020;Pradeep et al., 2021b). For convenience, we refer to these as "pointwise approaches", borrowing from the learning to rank literature (Li, 2011).
As an alternative, researchers have proposed approaches that consider multiple candidates at once to jointly arrive at a veracity prediction (Thorne et al., 2018;Nie et al., 2019a;Zhou et al., 2019;Stammbach and Neumann, 2019;Pradeep et al., 2021a). For convenience, we refer to these as "listwise approaches", also borrowing from the learning to rank literature (Li, 2011). Such listwise approaches have also been used for information aggregation in other NLP tasks such as question answering (Wang et al., 2018;Nie et al., 2019b). At a high level, this strategy suffers from a number of challenges, including data sparsity and a high level of sensitivity to noisy inputs. Following this thread of work, we adopt the listwise approach and improve it by training with a data augmentation technique that involves deliberately introducing noise into the training data to produce a more robust model.

Methods
Our work adopts a three-stage pipeline comprising document retrieval, sentence selection, and label prediction, which we detail in this section.

Document Retrieval
Given a claim q, our first step is to retrieve the top K most relevant documentsD(q) = {d 1 , ..., d K }. Since the corpus contains over 5.4M documents, we first perform document retrieval to narrow our search space. We leverage the Pyserini toolkit (Yang et al., 2017;Lin et al., 2021), which is based on the popular Lucene search engine, using the BM25 scoring function (Robertson and Zaragoza, 2009) to rank documents. Additional document retrieval details are described in Appendix A.3. We also incorporate document retrieval using the MediaWiki API, which has been shown in previous work to form a strong baseline (Hanselowski et al., 2018). We combine the results of the two methods by alternating through the two ranked lists of documents, skipping duplicates and keeping the top K unique documents.

Sentence Selection
Given a claim q and retrieved documentsD(q), the next stage in the pipeline selects the top L most relevant evidence sentencesŜ(q) = {s k 1 ,i 1 , ..., s k L ,i L }, where s k,i is the i-th sentence from document d k . Similar to how Soleimani et al.
(2019) and Subramanian and Lee (2020) frame this stage as a semantic matching problem using BERT-based models, we use T5 to rank the similarities between the claim and the sentences in each document. Introduced by Nogueira et al. (2020), like Pradeep et al. (2021a), we use T5 (Raffel et al., 2020) as a pointwise reranker, which they dub monoT5. Empirically, T5 has been found to be more effective at ranking than BERT-based models across a wide variety of domains.
As a sequence-to-sequence model, ranking is performed using the following input template: where q and s k,i are the claim and evidence sentence, respectively. To provide a broader context and to resolve ambiguities, we prepend each sentence s k,i with the title of document d k .
We fine-tune the model to generate the token "true" if s k,i ∈ S(q) and "false" otherwise. In terms of training data for fine-tuning, we use the gold evidence in the evidence sets in S(q) for "true" samples, but for the "false" samples, we sample negatives from the sentences inD(q).
At inference time, we construct a candidate set comprised of sentences from each document in D(q) in its retrieved order. Using the same input format, for each sentence, we probe the logits of the "true" and "false" tokens and apply the softmax function to produce a relevance probability score between 0 and 1; these scores are used to select the top L (= 5) sentences. For efficiency, instead of reranking all sentences inD(q), we take the first 200 sentences and only rerank this subset. Since there is an average of five non-empty sentences per document, we are roughly considering the top 40 documents fromD(q).
On top of the basic reranking input template of Nogueira et al. (2020), we introduce a novel enhancement where we append any named entities found within the claim to the input of monoT5. The intuition here is to prompt monoT5 to promote sentences that come from documents with titles that are similar to those entities, which tend to contain information that is relevant to verifying the claims. During fine-tuning, we use the names of the documents that contain the gold evidence as entities, but during inference, we extract named entities from the claims using the named entity recognition (NER) module built into spaCy's en core web sm model. 3 We append these entities, denoted as e 1 , ..., e j , to our monoT5 input template as follows: Query: q Document: s k,i Entity1: e 1 · · · Entityj: e j Relevant: Additional details are described in Appendix A.3.

Label Prediction
Given claim q and evidenceŜ(q), the final stage of the pipeline is to predict a veracity labelv(q).
Pointwise Aggregation One common method in the literature for label prediction is to combine the claim and each evidence sentence individually as the inputs to some model and aggregate those model outputs to obtain a veracity prediction. With the sequence-to-sequence nature of T5, we achieve this by fine-tuning the model with samples of the following input sequence: For fine-tuning, we use S(q) as the evidence for SUPPORTS and REFUTES samples. Similar to sentence selection, for NOINFO samples, we sample negatives from the top predicted sentences from upstream, which in this case, is sentence selection, using the full reranked candidate list instead of just the top L sentences inŜ(q).
Listwise Concatenation Another common strategy for label prediction is to concatenate all L sentences into a single input to some model and have the model directly classify the claim and list of evidenceŜ(q) as one of SUPPORTS, NOINFO, and REFUTES. Again, with T5, we use the following input sequence: query: q sentence1: s k 1 ,i 1 · · · sentenceL: s k L ,i L relevant: To obtain fine-tuning training data, we use the same method as for pointwise aggregation.
Listwise Data Augmentation To make label prediction more tolerant to noisy evidence in the top L sentences, we fine-tune T5 with augmented, noisy evidence sets: this mimics the model during inference more closely as there usually exists some non-gold evidence inŜ(q). To accomplish this, instead of fine-tuning directly with the gold evidence sets S(q), we fine-tune using I(S(q)), which "infuses" S(q) withŜ(q). Specifically, we define the transformation I as: For each s ∈ S(q) such that s ∈Ŝ(q), we randomly select an index k ofŜ(q) whereŜ(q)[k] ∈ S(q) and insert s atŜ(q)[k]. This is repeated iteratively, and so I(S(q)) returns the resulting list of sentencesŜ(q).
• If v(q) = NOINFO, I(S(q)) =Ŝ(q). Note that we use the same T5 input format as listwise concatenation. Training details for the label prediction stage can be found in Appendix A.3.

Results
We report the overall results of LisT5 on the FEVER development and blind test sets in Table 1, comparing the label prediction variations presented in Section 3.3. We also include the oracle FEVER score for our retrievedŜ(q) on line (2a). For reference, we compare LisT5 against several baselines and state-of-the-art techniques (drawn from the leaderboard) at the time of our work, shown in lines (1a)-(1h). From the results in Table 1, it is clear that the different label prediction strategies lead to vastly different FEVER scores. The top-performing method, according to both label accuracy and FEVER score, is trained with augmented data in a listwise manner, found on line (2e). This run represents the state of the art atop the FEVER leaderboard at the time of our work. The other methods that fine-tune with only gold evidence data, found on lines (2b) to (2d), seem to trail by over 10 points. These results suggest the importance of training with augmented listwise evidence sets, which is presented in Section 3.3.
Contrary to the results reported in some papers, our concatenation methods consistently outperform corresponding aggregation methods: this suggests that T5 is able to capture inter-sentence semantics and use information from multiple, possibly diverse, pieces of evidence to come to veracity conclusions. Specifically, the T5 variant on line (2e) achieves 78.02% 4 (174/223) label accuracy on claims in the development set that require retrieving at least two pieces of evidence in conjunction to verify, which is close to our overall label accuracy of 81.26%. This finding suggests that T5 is capable of incorporating and corroborating the information contained in multiple pieces of evidence, which is one of the most common needed areas of improvement described in previous papers. Table 2 compares the LisT5 sentence selection results of the monoT5 variations described in Section 3.2. We include some results from baselines, using recall at five as the primary sentence selection metric, which by definition is an upper-bound for the downstream FEVER score. We format the results for LisT5 as an ablation analysis focused on sentence selection. Line (2a) shows the results of the full monoT5 model with NER and fine-tuned on the FEVER dataset; monoT5 without NER features but fine-tuned on the FEVER dataset is shown on line (2b). Finally, we have zero-shot monoT5 on line (2c) to show the results of monoT5 without fine-tuning on the FEVER dataset, i.e., directly from the model checkpoints of Nogueira

Error Analysis
We randomly select 200 incorrectly predicted claims by LisT5 and summarize the most common issues, hoping to identify areas of improvement for future fact verification systems.
One common issue is failing to distinguish between similar but semantically different words or phrases. An example of this is the claim "Shane McMahon officially retired on the first day of 2010" to which our document retrieval and sentence selection stages retrieve the sentence "In 2009, McMahon announced his resignation from WWE which went into effect January 1, 2010". Here, retirement and resignation are semantically similar words that both describe individuals leaving their positions. These similarities may have been learned by the pretrained transformer, but it is not always the case that the words imply one another, leading to an incorrect prediction for this claim.
Another frequent issue is incorrectly labelled claims in the FEVER dataset, often due to missing evidence in S(q). An example of this is the claim "Mickey Rourke appeared in a sequel" to which our document retrieval and sentence selection stages retrieve the sentence "Since then, Rourke has ap-peared in several commercially successful films including the 2010 films Iron Man 2 and The Expendables and the 2011 film Immortals". However, the claim was labelled NOINFO in the dataset, which is incorrect due to Iron Man 2 indeed being a sequel. In short, we are bumping into data quality issues in the annotations themselves.

Conclusion
In this paper, we present the LisT5 framework for automated fact verification. LisT5 consists of a three-stage pipeline -document retrieval, sentence selection, and label prediction. For document retrieval, we combine two strong document retrieval baselines. For sentence selection, we fine-tune a T5 model as a reranker with named entities provided as additional features. For label prediction, we present evidence in a listwise manner to a T5 model, trained on augmented data. Our experimental results indicate that LisT5 achieves the state of the art on the FEVER task, which we attribute to the framework's ability to reason jointly over multiple pieces of evidence.

A.1 FEVER Dataset
The dataset used for training and evaluating our fact verification system is FEVER (Thorne et al., 2018), a large-scale dataset consisting of 185K claims with evidence taken from Wikipedia. We include the label distribution of the dataset across its training, development, and blind test set in Table 3.

A.2 Baseline Details
As discussed in Section 3, most fact verification systems, especially for the FEVER task, consist of a three-stage pipeline similar to the one used in LisT5. The stages are as follows: However, there has also been active research into graph-based models for knowledge aggregation by modelling evidence sentences as nodes in a graph (Zhou et al., 2019;Liu et al., 2020;Zhong et al., 2020).

A.3 Implementation and Training Details
Document Retrieval We retrieve with BM25 using the parameters k 1 = 0.6 and b = 0.5. These parameters are tuned by running a grid search over parameter values in 0.1 increments over a subset of the training set.
Sentence Selection Whenever we fine-tune monoT5, we use the T5-3B variant, which as its name suggests, contains three billion parameters. We fine-tune the model with batch size 128 over one epoch, using the configurations prescribed by Raffel et al. (2020), except that we use learning rate 0.0001 instead of 0.001. While training, we save checkpoints at evenly spaced iteration intervals, usually around 1000 iterations per checkpoint depending on the size of the training data. Thus, whenever we report the results of a model, we use the results of the best performing checkpoint on the FEVER development set. We fine-tune on TPU v3-8 nodes on the Google Cloud Platform, which takes around 24 hours. Note that we first fine-tune a pretrained T5 model on the MS MARCO passage dataset (Bajaj et al., 2018) for 10000 iterations, following best practices reported in previous work (Akkalyoncu Yilmaz et al., 2019;Nogueira et al., 2020;Zhang et al., 2020;Pradeep et al., 2020Pradeep et al., , 2021c, which has shown that this leads to improved effectiveness. This procedure also gives us a zero-shot setting for fact verification, which we experiment with before fine-tuning on the FEVER dataset directly. In our experiments, we note that negative sampling sentences from highly-ranked documents in D(q) leads to poorly performing models. This may be due to false negatives in the data, where some claims are labelled as NOINFO but are actually verifiable, with relevant evidence retrieved by our document retrieval stage. To avoid negative sampling such false negative evidence, we negative sample sentences ranked between 50 and 200.
Label Prediction Again, we use the T5-3B variant as the model for label prediction. We use similar settings for fine-tuning T5 as before for monoT5, except that we use the default learning rate 0.001.  We also fine-tune on TPU v3-8 nodes on the Google Cloud Platform, which takes around 8 hours.
To avoid similar negative sampling issues encountered in fine-tuning models for sentence selection, we sample from sentences ranked between 10 and 25 here.

A.4 Document Retrieval Results
We report the importance of combining the two document retrieval methods described in Section 3.1 by comparing their recall at rank 1000 in Table 4. These figures show that combining the two techniques results in being only a few points away from perfectly retrieving all relevant documents.