A DQN-based Approach to Finding Precise Evidences for Fact Verification

Computing precise evidences, namely minimal sets of sentences that support or refute a given claim, rather than larger evidences is crucial in fact verification (FV), since larger evidences may contain conflicting pieces some of which support the claim while the other refute, thereby misleading FV. Despite being important, precise evidences are rarely studied by existing methods for FV. It is challenging to find precise evidences due to a large search space with lots of local optimums. Inspired by the strong exploration ability of the deep Q-learning network (DQN), we propose a DQN-based approach to retrieval of precise evidences. In addition, to tackle the label bias on Q-values computed by DQN, we design a post-processing strategy which seeks best thresholds for determining the true labels of computed evidences. Experimental results confirm the effectiveness of DQN in computing precise evidences and demonstrate improvements in achieving accurate claim verification.


Introduction
With the growing false information, such as fake news, political deception and online rumors, automatic fact-checking systems have emerged to automatically identify and filter this information. Fact verification (FV) is a special fact-checking task that aims to retrieve related evidences from a text corpus to verify a textual claim.
Taking Figure 1 as example, an existing method for FV first retrieves related documents from the given corpus at stage 1 (namely the document retrieval stage), then finds key sentences from the documents at stage 2 (namely the sentence selection stage), and finally treats the set of key sentences as an evidence to verify the claim at stage Figure 1: The pipeline for FV on FEVER. Underlined words in blue italics given in evidence provide key information to determine the truthfulness of the claim. "SUPPORTS" / "REFUTES" / "NOT ENOUGH INFO" indicates that the evidence can support / refute / is insufficient for supporting or refuting the claim. Both the evidence and label are output by FV.
3 (namely the claim verification stage). As can be seen in this example, it is desirable to retrieve an evidence consisting of the first two sentences only, since it does not contain unnecessary sentences to determine the truthfulness of the claim and can alleviate human efforts to further validate the evidence. More importantly, an evidence containing unnecessary sentences may involve conflicting pieces some of which support the claim while the other refute the claim. Thus, it is crucial to compute minimal sets of sentences that can determine the truthfulness of the claim. In this paper, we refer to a minimal set of sentences that supports or refutes a given claim as a precise evidence.
Existing methods for FV do not target the retrieval of precise evidences. Most existing studies (Thorne et al., 2018b;Nie et al., 2019;Zhou et al., 2019;Zhong et al., 2020;Ye et al., 2020;Subramanian and Lee, 2020; formulate FV as a three-stage pipeline task as illustrated in Figure 1. This way makes the retrieval of precise evidences extremely difficult since the sentence selection stage is required to select a precise set of relevant sentences rather than a fixed number of sentences as in existing methods. To the best of our knowledge, TwoWin-gOS (Yin and Roth, 2018) is the only method by now which does not follow the three-stage pipeline. Instead, it exploits a supervised training scheme to train the last two stages jointly and is able to compute precise evidences. However, it exhibits a significantly worse performance than other state-ofthe-art methods for FV, especially in terms of the recall of evidences. Therefore, there is still a need for designing new methods to compute precise evidences. These methods are expected to achieve better performance than TwoWingOS.
It is challenging to compute precise evidences. On one hand, the search space for precise evidences is very large. For example, in the benchmark Fact Extraction and VERification (FEVER) dataset (Thorne et al., 2018b) the average number of sentences for each claim input to the sentence selection stage is 40, and an output evidence has up to 5 sentences. Hence there are up to 5 i=1 C i 40 = 760, 098 candidates in the search space. On the other hand, greedy search of precise evidences easily falls into a local optimum. As shown in our experiments (see Table 6), a greedy search method does not perform well.
Inspired by the strong exploration ability of the Deep Q-learning Network (DQN) (Mnih et al., 2015), we develop a DQN-based approach to retrieval of precise evidences. In this approach, we first employ DQN to compute candidate pairs of precise evidences and their labels, and then use a post-processing strategy to refine candidate pairs. We notice that Q-values computed by DQN has label bias due to two reasons. On one hand, the label "NOT ENOUGH INFO" does not locate at the same concept level as "SUPPORTS" or "RE-FUTES". On the other hand, there is not a fixed range for Q-values, making Q-values hard to accurately estimate. Thus, a post-processing strategy is needed to tackle the label bias on Q-values. We develop such a strategy to seek best thresholds in determining the true labels of computed evidences.
Our experimental results on FEVER (Thorne et al., 2018b) confirm that our DQN-based approach is effective in finding precise evidences. More importantly, the approach is shown to outper-form state-of-the-art methods for FV.

Fact Extraction and Claim Verification
The FEVER 1.0 shared task (Thorne et al., 2018b) aims to develop an automatic fact verification system to determine the truthfulness of a textual claim by extracting related evidences from Wikipedia. Thorne et al. (2018a) has formalized this task, released a large-scale benchmark dataset FEVER (Thorne et al., 2018b), and designed the three-stage pipeline framework for FV, which consists of the document retrieval stage, the sentence selection stage and the claim verification stage. Most existing methods follow this framework and mainly focus on the last stage . For the document retrieval stage, most methods reuse the document retrieval component of topperforming systems (Hanselowski et al., 2018;Yoneda et al., 2018;Nie et al., 2019). For the sentence selection stage, there are three approaches commonly used, including keyword matching, supervised classification, and sentence similarity scoring (Thorne et al., 2018b). For the claim verification stage, most recent studies formulate this task as a graph reasoning task (Zhou et al., 2019;Ye et al., 2020;Zhong et al., 2020;Subramanian and Lee, 2020;. Different from most existing methods that focus on claim verification, Yin and Roth (2018) proposed a supervised training method named TwoWingOS to jointly conduct sentence selection and claim verification.
Nowadays pre-trained language models like BERT (Devlin et al., 2019) have been widely used in claim verification Zhou et al., 2019;Soleimani et al., 2020). Following this way we employed RoBERTa , an enhanced version of BERT, as the sentence encoder in our DQN-based approach in experiments.

Deep Q-learning Network
Reinforcement learning (RL) is about an agent interacting with the environment, objective to maximize the cumulative rewards of a sequence of states and actions by adjusting its policies. Q-Learning (Mnih et al., 2015) is a popular reinforcement learning technique. It aims to approximate the optimal value function Q * (o, a) to measure the expected long-term rewards for a given pair of state o and action a. Deep Q-learning Network (DQN) (Mnih et al., 2015) is a combination of deep learning and Q-Learning. It typically uses the following Equation (1) derived from the Bellman equation (Cao and ZhiMin, 2019) to approximate the optimal Q-value function: where o (t) , a (t) , r (t) respectively denote the state, action and reward at step t, and λ ∈ [0, 1] is a discounted factor for future rewards.

Problem Setting
Given a set of candidate sentences S = {s 1 , s 2 , . . . }, a claim c, a set of precise evidences E ⊂ 2 S , and a true label y ∈ Y = {T,F,N} that determines whether every precise evidence supports or refutes the claim, where T/F/N denotes "SUP-PORTS"/"REFUTES"/"NOT ENOUGH INFO", we aim to train a model to predict a precise evidence; more precisely, to train a model for retrieving an evidenceÊ ⊂ S and predicting a label y ∈ Y such thatŷ = y andÊ = E for some E ∈ E. This goal is different from the goal targeted by existing methods, which aim to retrieve an evidenceÊ ⊂ S and predict a labelŷ ∈ Y such thatŷ = y and E ⊆Ê for some E ∈ E.
We define the four ingredients of DQN namely states, actions, transitions and rewards as follows: • State. A state o is a tuple (c,Ê,ŷ) for c a claim,Ê a set of sentences andŷ a label.
• Action. An action a is a sentence in S.
where the number K is a hyper-parameter, and |S| denotes the cardinality of a set S.

The DQN-based Model
The core of our proposed approach is the DQNbased model, illustrated in Figure 2.

Sentence Encoding Module
We employ RoBERTa in this module to extract the final hidden state of s as the sentence representation, where s and /s mentioned in the following are the special classification tokens in RoBERTa. Specifically, following KGAT , we first concatenate the claim c, the document title l, and a sentence s (resp. an action a) as " s c /s l /s s /s " (resp. " s c /s l /s a /s ") and then feed it into RoBERTa to obtain the sentence representation h s ∈ R d 0 (resp. the action representation h a ∈ R d 0 ), where d 0 is the dimension of the representation. We also feed the claim " s c /s " alone to obtain the claim representation h c ∈ R d 0 .

Evidence Encoding Module
This module is used to get an aggregated evidence representation. It consists of two sub-modules. Context sub-module. It is obvious that the sentences in an evidence are always contextual dependent, so we apply two different networks BiLSTM (Nguyen et al., 2016) and Transformer (Vaswani et al., 2017) for comparison. These two different networks are widely used to encode contextualaware information of sequential text in the NLP community. Formally, we either define if the BiLSTM network is used, or define context-aware sentence representation inÊ, and d 1 is the dimension of the representation. Aggregation sub-module. This sub-module is used to fuse the sentence representations in evidences to obtain an aggregated evidence representation. We also apply two different networks in this sub-module: Transformer and attention. Unlike the Transformer with self-attention in the first submodule, the query in this sub-module is the claim and the key/value is the context-aware sentence representation from the first sub-module. For the Figure 2: The architecture of the DQN-based model. The input is a state and an action, and the output is the Qvalue of each label. The sentence encoding module is used to compute the sentence representation. The evidence encoding module is used to compute the evidence representation. The value module is used to predict the Q-value for each label.
attention network, we define where e ∈ R d 1 is the aggregated evidence representation, MLP(·) = Linear(ReLU(Linear(·))) is a two-layer fully connected network using rectified linear unit as the activation function, and [; ] denotes the concatenation of two vectors.

Value Module
This module is used to obtain the Q-value vector for all labels, simply written as Q(o, a; θ) for θ denoting the set of learnable parameters, which is formally defined as where MLP(·) = Linear(ReLU(Linear(·))) is similar to MLP(·) used in Equation (6) except that different parameters in linear layers are used, W ∈ R d 0 ×d 0 is a learnable matrix, and Q(o, a; θ) ∈ R d 2 for d 2 the number of different labels.

Objective Function
Given a transition (o (t) , a (t) , o (t+1) ) and its reward r (t) , we use the Double Deep Q-learning Network (DDQN) (Mnih et al., 2015) technique to train our model through the temporal difference error (Mnih et al., 2015). This error δ is formally defined as for a * = arg max a∈S\Ê Qŷ(o, a; θ).
In the above equation,Q(·;θ) is the target network in DDQN, Qŷ denotes the Q-value ofŷ for y the predicted label in o,Ê is the predicted evidence in o, and λ ∈ [0, 1] is a hyper-parameter representing the discount factor.
We use the Huber loss to minimise δ: where B is a batch of transition-reward pairs.

Model Training
Algorithm 1 shows how to train the DQN-based model. First, we initialize three replay memories, the DQN-based model, and the target network in Line 1-3. Then, in Line 9-17, we obtain the training Algorithm 1: Model training for DQN, where the memory capacity M , the maximum evidence size K, the maximum number of epochs T and the reset interval C are hyper-parameters.
where Q(·) is defined in Eq. (7) and Qŷ denotes the Q-value ofŷ. transition-reward pairs by letting the DQN-based model interact with the environment in an -greedy exploration-exploitation way (Mnih et al., 2015). Finally, in Line 19, we sample a mini-batch of transition-reward pairs to update the DQN-based model, while in Line 20, for every C steps we reset the target network to the DQN-based model.

Candidate Retrieval
Algorithm 2 shows how to retrieve a pair (candidate list, score list) for each label, where the can- y . 8 storeÊ (t+1) intoÊŷ and q (t) into qŷ.
y , . . . ,q didate list stores progressively enlarged sentence sets, where each sentence set is a candidate of the predicted evidence, and the score list stores the strengths that the corresponding candidates support the label. We enlarge the two-list pair for each label through a greedy-search way (Line 3-10). Specifically, for each label, we first select the action with the largest Q-value (Line 5), then update the state by adding the chosen action into its predicted evidence (Line 7), and finally add the evidence and score into the corresponding list (Line 8).
Algorithm 4: Searching for best thresholds, where min qŷ is short for min t q (t) y and max qŷ for max t q (t) y , for allŷ ∈ {T, F, N}.
store v into L N and (v, y) into C N . 7 end 8 end 9 sort L N in ascending order. 10 calculate the medians of adjacent values in L N and store them into L N .

Final Prediction
Algorithm 3 shows how to compute the target evidence-label pair from the (candidate list, score list) pairs obtained by Algorithm 2, where the thresholds are determined by Algorithm 4. In this algorithm, we first use the condition given by Algorithm 4 to predict N (Line 3), and then refine the prediction of T (Line 6) and F (Line 9) in turn. In Line 2, we focus on the evidences with the highest score for T and F, while we ignore the evidence for N, due to the following reasons: (1) there are no supporting sentences in the evidence for N; (2) we follow a strategy commonly used in existing methods for FV, i.e., focusing only on the evidence for T and F.

Threshold Searching
Algorithm 4 shows how to search for the best thresholds (α T , α F , α N ) to maximize the Label Accuracy (LA) over the development set. We first call Algorithm 2 to construct a set of tuples (q T , q F , q N , y) from the development set, each of which corresponds to a development instance, where q T , q F and q N are respectively the output score lists for the three labels T, F and N, and y is the corresponding true label (Line 1). We then go through the following two stages. The first stage (Line 3-11) finds a threshold α N that can maximize LA for label N, where maximizing LA is amount to maximizing the difference between the number of correctly and incorrectly predicted instances. The second stage (Line 12-28) finds the thresholds α T and α F that can maximize LA for label T and F, respectively, where those instances that satisfy the conditions for N are neglected (Line 13).

Dataset
Our experiments are conducted on the large-scale benchmark dataset FEVER (Thorne et al., 2018a), which consists of 185,455 annotated claims with a set of 5,416,537 Wikipedia documents from the June 2017 Wikipedia dump. All claims are labeled as "SUPPORTS", "REFUTES", or "NOT ENOUGH INFO". What's more, each claim for "SUPPORTS" and "REFUTES" is accompanied by some evidences extracted from Wikipedia documents. The dataset partition is kept the same with Thorne et al. (2018b) as shown in Table 1.

Evaluation Metrics
The task has five evaluation metrics: 1) FEVER, the primary scoring metric that measures the accuracy of claim verification with a requirement that the predicted evidences fully covers the ground-true evidences for SUPPORTS and REFUTES claims; 2) Label Accuracy (LA), the accuracy of claim verification without considering the validity of the  predicted evidences; 3) Precision (Pre), the macroprecision of the evidences for SUPPORTS and RE-FUTES claims; 4) Recall, the macro-recall of the evidences for SUPPORTS and REFUTES claims; 5) F1, the F 1 -score of the evidences for SUPPORTS and REFUTES claims. We choose F1 as our main metric because it can directly show the performance of methods on retrieval of precise evidences.

Implementation Details
Document retrieval. The document retrieval stage is kept the same as previous work (Hanselowski et al., 2018;Zhou et al., 2019;Ye et al., 2020). Given a claim, the method first utilizes the constituency parser from AllenNLP (Gardner et al., 2018) to extract potential entities from the claim. Then it uses the entities as search queries to find the relevant documents via the online Me-diaWiki API 2 . The convinced articles are reserved (Hanselowski et al., 2018). Sentence selection and claim verification. We implement our DQN-based model with PyTorch and train it with the AdamW (Loshchilov and Hutter, 2019) optimizer while keeping the sentence encoding module frozen and inheriting the RoBERTa implementation from Wolf et al. (2020) 3 . Specifically, the learning rate is 5e-6, the batch size is 128, the training epochs is 30, the iteration steps (or largest evidence size, i.e., K) is 5, the discount factor λ is 0.95, and the layer number of the context sub-module is 3. Prioritized experience replay memory (Schaul et al., 2016) with a capacity of 10,000 is used to store transitions. The target network is reset when DQN is updated every 10 times. The probability of -greedy policy starts at 0.9 and decays exponentially towards 0.05, and the rate of the decay is 1 2000 . Table 2 shows the thresholds α T , α F and α N computed by Algorithm 4. All experiments were conducted on an NVIDIA GTX 2080ti 10GB GPU.

Baselines
We compare our method with the following baselines, including six methods that focus on claim verification and one joint method TwoWingOS (Yin and Roth, 2018). The six methods include: (1) GEAR (

Results and Analysis
As shown in Table 3, we implement four versions of the evidence encoding module and evaluate them on the DEV set and the blind TEST set. The FEVER metric of the top six methods is calculated with the imprecise evidences, so we introduce the FEVER@5 metric for a fair comparison. We analyze our method from the following four aspects.
Comparison with the state-of-the-art methods. Table 3 show that all versions (except BiLSTM-A) with post-processing significantly outperform the state-of-the-art methods on FEVER, Pre, and F1, especially for T-A on F1, which shows the superiority of our method in retrival of precise evidences. However, none of the four versions of our method can achieve the best result on FEVER@5, LA, and Recall. The reason for low recall is that the number of sentences in precise evidences is less than that in imprecise evidences, which means other methods have a higher probability to recall the ground-true evidences than ours. Besides, the relatively low LA is caused by the    low Recall of precise evidences. To further clarify this point, we evaluate our method on a subset of the DEV set where the ground-true evidences are recalled successfully. Our method improves significantly the performance on this subset, as shown in Table 4, which justifies our point of view. FEVER is affected by the LA and Recall, thereby the low FEVER@5 is also due to the low recall of precise evidences. In addition, the results reported in Table 5 show that our method can significantly reduce the number of unnecessary sentences in a predicted evidence.

Results in
Comparison between different versions. As shown in Table 3, T-T and T-A perform respectively better than BiLSTM-T and BiLSTM-A on almost all metrics except that T-T is slightly worse  Table 6: The beam-search result of KGAT on the DEV set (%). The width (k) means to select the top-k results at each search step. The result obtained with/without post-processing (namely threshold searching and final prediction) is displayed in each width's first/second row ("w."/"o."). We employed the KGAT source code released by  to implement beam-search for finding precise evidences and the evaluation data for KGAT was kept the same as ours.
than BiLSTM-A on FEVER@5, which suggests Transformer can encode better context-aware representations than BiLSTM in our context sub-module. Moreover, we find that T-A performs better than T-T on almost all metrics except Recall and that BiLSTM-A is worse than BiLSTM-T on Pre and F1. This contrary result shows that the performance of the aggregation sub-module is impacted by the context sub-module. Thus, the choice between Transformer and Attention should depend on the context sub-module. Overall, T-A achieves the best performance among all the four versions of our Table 7: Cases in FEVER. We list the predicted evidences of GEAR, KGAT and our method. (title, i) denotes the i-th sentence in the corresponding wiki document. In predicted evidences, the sentences highlighted in blue bold italics and underline are sentences in the target evidence while others in black are unnecessary ones.
proposed method.
Comparison on retrieval of precise evidences.
TwoWingOS is a supervised-learning method that can also find precise evidences. Although it achieves slightly better performance on LA than ours, its F1 and other metrics are much worse, indicating that it performs worse than our method except for BiLSTM-A in retrieval of preciseevidences. We also enhance KGAT to conduct beam-search for finding precise evidences and report the results in Table 6. The F1 score of KGAT is always higher than TwoWingOS but is still lower than our method except for BiLSTM-A.
Comparison between the methods with and without post-processing. It can be seen from Table 3 and Table 6 that, post-processing (namely threshold searching and final prediction from candidates) consistently improves FEVER and LA. Although with post-processing, our method (except T-A) achieves slightly lower scores on FEVER@5, KGAT still achieves significantly higher scores on FEVER@5 as on other metrics. These results show that post processing is very important in retrieval of precise evidences.

Case Study
In Table 7 we provide some cases to demonstrate the effectiveness of our method (T-A) in retrieving precise evidences. In case#1 and case#2, our method exactly finds ground-true evidences without introducing any unnecessary sentence, while GEAT and KGAT cannot. In case#3 and case#4, our method generates less unnessary sentences in prdicted evidents than GEAT and KGAT do.

Conclusion and Future Work
In this paper, we have proposed a novel DQN-based approach to finding precise evidences for fact verification. It provides a method to solve the preciseevidence problem by first employing a DQN to compute some candidates and then introducing a post-processing strategy to extract the target evidence and its label from the candidates. Experimental results show that the approach achieves state-of-the-art performance in terms of retrieval of precise evidences. Besides, to the best of our knowledge, it is the first attempt to employ DQN in the fact verification task. Future work will incorporate external knowledge into our approach to improve the retrieval recall.