Generating Literal and Implied Subquestions to Fact-check Complex Claims

Verifying political claims is a challenging task, as politicians can use various tactics to subtly misrepresent the facts for their agenda. Existing automatic fact-checking systems fall short here, and their predictions like “half-true” are not very useful in isolation, since it is unclear which parts of a claim are true and which are not. In this work, we focus on decomposing a complex claim into a comprehensive set of yes-no subquestions whose answers influence the veracity of the claim. We present CLAIMDECOMP, a dataset of decompositions for over 1000 claims. Given a claim and its verification paragraph written by fact-checkers, our trained annotators write subquestions covering both explicit propositions of the original claim and its implicit facets, such as asking about additional political context that changes our view of the claim’s veracity. We study whether state-of-the-art models can generate such subquestions, showing that these models generate reasonable questions to ask, but predicting the comprehensive set of subquestions from the original claim without evidence remains challenging. We further show that these subquestions can help identify relevant evidence to fact-check the full claim and derive the veracity through their answers, suggesting that they can be useful pieces of a fact-checking pipeline.


Introduction
Despite a flurry of recent research on automated fact-checking (Wang, 2017;Rashkin et al., 2017;Volkova et al., 2017;Ferreira and Vlachos, 2016;Popat et al., 2017;Tschiatschek et al., 2018), we remain far from building reliable fact-checking systems (Nakov et al., 2021).This challenge motivated us to build more explainable models so the explanations can at least help a user interpret the results Figure 1: An example claim decomposition: the top two subquestions follow explicitly from the claim and the bottom two represent implicit reasoning needed to verify the claim.We can use the decomposed questions to retrieve relevant evidence (Section 6), and aggregate the decisions of the sub-questions to derive the final veracity of the claim (Section 5.3).(Atanasova et al., 2020).However, such purely extractive explanations do not necessarily help users interpret a model's reasoning process.An ideal explanation should do what a human-written factcheck does: systematically dissect different parts of the claim and evaluate their veracity.
We take a step towards explainable fact-checking with a new approach and accompanying dataset, CLAIMDECOMP, of decomposed claims from Poli-tiFact.Annotators are presented with a claim and the justification paragraph written by expert factcheckers, from which they annotate a set of yesno subquestions that give rise to the justification.These subquestions involve checking both the explicit and implicit aspects of the claim (Figure 1).Such a decomposition can play an important role in an interpretable fact verification system.First, the subquestions provide a comprehensive explanation of how the decision is made: in Figure 1, although the individual statistics mentioned by Biden Figure 2: An example of our annotation process.The annotators are instructed to write a set of subquestions, give binary answers to them, and attribute them to a source.If the answer cannot be decided from the justification paragraph, "Unknown" is also an option.The question is either based on the claim or justification, and the annotators also select the relevant parts (color-coded in the figure) on which the question is based.
are correct, they are from different time intervals and not directly comparable, which yields the final judgment of the claim as "half-true".We can estimate the veracity of a claim using the decisions of the subquestions (Section 5.3).Second, we show that decomposed subquestions allow us to retrieve more relevant paragraphs from the verification document than using the claim alone (Section 6), since some of the subquestions tackle implicit aspects of a claim.We do not build a full pipeline for fact verification in this paper, as there are other significant challenges this poses, including information which is not available online or which needs to be parsed out of statistical tables (Singh et al., 2021).Instead, we focus on showing how these decomposed questions can fit into a fact-checking pipeline through a series of proof-of-concept experiments.
Equipped with CLAIMDECOMP dataset, we train a model to generate decompositions of complex political claims.We experiment with pre-trained sequence-to-sequence models (Raffel et al., 2020), generating either a sequence of questions or a single question using nucleus sampling (Holtzman et al., 2020) over multiple rounds.This model can recover 58% of the subquestions, including some implicit subquestions.To summarize, we show that decomposing complex claims into subquestions can be learned with our dataset, and reasoning with such subquestions can lead improve evidence retrieval and judging the veracity of the whole claim.

Motivation and Task
Facing the complexities of real-world political claims, simply giving a final veracity to a claim often fails to be persuasive (Guo et al., 2022).To make the judgment of an automatic fact-checking system understandable, most previous work has focused on generating justifications for models' decisions.Popat et al. (2018); Shu et al. (2019); Lu and Li (2020) used attention weights of the models to highlight the most relevant parts of the evidence, but these only deal with explicit propositions of a claim.Ahmadi et al. (2019); Gad-Elrab et al. (2019) used logic-based systems to generate justifications, yet the systems are often based on existing knowledge graphs and are hard to adapt to complex real-world claims.Atanasova et al. (2020) treated the justification generation as a summarization problem in which they generate a justification paragraph according to some relevant evidence.Even so, it is hard to know which parts of the claim are true and which are not, and how the generated paragraph relates to the veracity.
What is missing in the literature is a better intermediate representation of the claim: with more complex claims, explaining the veracity of a whole claim at once becomes more challenging.Therefore, we focus on decomposing the claim into a minimal yet comprehensive set of yes-no subquestions, whose answers can be aggregated into an inherently explainable decision.As the decisions to the subquestions are explicit, it is easier for one to spot the discrepancies between the veracity and the intermediate decisions.
Claims and Justifications Our decomposition process is inspired by fact checking documents written by professional fact checkers.In the data we use from PolitiFact, each claim is paired with a justification paragraph (see Figure 2) which contains the most important factors on which the Table 1: Statistics of the CLAIMDECOMP dataset.Each claim is annotated by two annotators, yielding a total of 6,555 subquestions.The second column blocks (Answer % and Source %) reports the statistics at the subquestion level, with the Source % denotes the percentage of subquestions based on the text from the justification or the claim.
veracity made by the fact-checkers is based.Understanding what questions are answered in this paragraph will be the core task our annotators will undertake to create our dataset.However, we frame the claim decomposition task (in the next section) without regard to this justification document, as it is not available at test time.
Claim Decomposition Task We define the task of complex claim decomposition.Given a claim c and the context o of the claim (speaker, date, venue of the claim), the goal is to generate a set of N yesno subquestions q = {q 1 , q 2 , ...q N }.The set of subquestions should have the following properties: • Comprehensiveness: The questions should cover as many aspects of the claim as possible: the questions should be sufficient for someone to judge the veracity of the claim.
• Conciseness: The question set should be as minimal as is practical and not contain repeated questions asking about minor, correlated variants seeking the same information.
An individual subquestion should also exhibit: • Relevance: The answer to subquestion should help a reader determine the veracity of the claim.Knowing an answer to a subquestion should change the reader's belief about the veracity of the original claim (Section 5.3).
We do not require subquestions to stand alone (Choi et al., 2021); they are instead interpreted with respect to the claim and its context.

Evaluation Metric
We set the model to generate the target number of subquestions, which matches the number of subquestions in the reference, guaranteeing a concise subquestion set.Thus, we focus on measuring the other properties with referencebased evaluation.Specifically, given an annotated set of subquestions and an automatically predicted set of subquestions, we assess recall: how many subquestions in the reference set are covered by the generated question set?A subquestion in the reference set is considered as being recalled if it is semantically equivalent to one of the generated subquestions by models.2Our notion of equivalence is nuanced and contextual: for example, the following two subquestions are considered semantically equivalent: "Is voting in person more secure than voting by mail?" and "Is there a greater risk of voting fraud with mail-in ballots?".We manually judge the question equivalence, as our experiments with automatic evaluation metrics did not yield reliable results (details in Appendix E).

Dataset Collection
Claim / Verification Document Collection We collect political claims and corresponding verification articles from PolitiFact.3Each article contains one justification paragraph (see Figure 2) which states the most important factors on which the veracity made by the fact-checkers is based.Understanding what questions are answered in this paragraph will be the core annotation task.Each claim is classified as one of six labels: pants on fire (most false), false, barely true, half-true, mostly true, and true.We collect the claims from top 50 PolitiFact pages for each label, resulting in a total of 6,859 claims.A claim like "Approximately 60,000 Canadians currently live undocumented in the USA." hinges on checking a single statistic and is less likely to contain information beyond the surface form.Therefore, we mainly focus on studying complex claims  in this paper.To focus on complex claims, we filter claims with 3 or fewer verbs.We also filter out claims that do not have an associated justification paragraph.After the filtering, we get a subset consisting 1,494 complex claims.
Decomposition Annotation Process Given a claim paired with the justification written by the professional fact-checker on PolitiFact, we ask our annotators to reverse engineer the fact-checking process: generate yes-no questions which are answered in the justification.As shown in Figure 2, for each question, the annotators also (1) give the answer; (2) select the relevant text in the justification or claim that is used for the generation (if any).
The annotators are instructed to cover as many of the assertions made in the claim as possible without being overly specific in their questions.This process gives rise to both literal questions, which follow directly from the claim, and implied questions, which are not necessarily as easy to predict from the claim itself.These are not attributes labeled by the annotators, but instead labels the authors assign post-hoc (described in Section 5).
We recruit 8 workers with experience in literature or politics from the freelancing platform Upwork to conduct the annotation.Appendix A includes details about the hiring process, workflow, as well as instructions and the UI.

Dataset statistics and inter-annotator agreement
Table 1 shows the statistics of our dataset.We collect two sets of annotations per claim to improve subquestion coverage.We collect a total of 6,555 subquestions for 1,200 claims.Most of the questions arise from the justification and most of the questions can be answered by the justification.In addition, we randomly sample 50 claims from the validation set for our human evaluation in the rest of this paper.We name this set Validation-sub.
Comparing sets of subquestions from different annotators is nontrivial: two annotators may choose different phrasings of individual questions and even different decompositions of the same claim that end up targeting the same pieces of information.Thus, we (the authors) manually compare two sets of annotations to judge inter-annotator agreement: given two sets of subquestions on the same claim, the task is to identify questions for which the semantics are not expressed by the other question set.If no questions are selected, it means that the two annotators show strong agreement on what should be captured in subquestions.Example annotations are shown in Appendix D.
We randomly sample 50 claims from our dataset and three of the authors conduct the annotation.The authors agree on this comparison task reasonably, with a Fleiss' Kappa (Fleiss, 1971) value of 0.52.The comparison results are shown in Table 2. On average, the semantics of 18.4% questions are not expressed by the other set.This demonstrates the comprehensiveness of our set of questions: only a small fraction is not captured by the other set, indicating that independent annotators are not easily coming up with distinct sets of questions.Because most questions are covered in the other set, we view the agreement as high.A simple heuristic to improve comprehensiveness further is to prefer the annotator who annotated more questions.If we consider the fraction of unmatched questions in the FEWER QS, we see this drops to 8.5%. 4 Through this manual examination, we also found that annotated questions are overall concise, fluent, clear, and grammatical.

Automatic Claim Decomposition
The goal is to generate a subquestion set q from the input claim c, the context o, and the target number of subquestions k.

Models
We fine-tune a T5-3B (Raffel et al., 2020) model to automate the question generation process under two settings: QG-MULTIPLE and QG-NUCLEUS as shown in Figure 3.Both generation methods generate the same number of subquestions, equal to the number of subquestions generated by an annotator.

QG-MULTIPLE
R e w B t 4 1 5 6 1 V + 1 D + 5 y 1 r m j F z B G Y k / b 9 C x 3 3 n 7 g = < / l a t e x i t > c < l a t e x i t s h a 1 _ b a s e 6 4 = " a S W e i U G j R 0 x 6 J w P Z y 8 f V I f s q m y w = " > A A L e w L v 2 r L 1 q H 9 r X r L S g 5 T 2 H Y A 7 a z y 9 R W q H K < / l a t e x i t > q 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " U S 1 J r f o / e x q d i j Table 3: Human evaluation results on the Validation-sub set (N=146).R-all denotes the recall for all questions; R-literal denotes the recall for the literal questions and R-implied denotes the recall for the implied questions.
concatenated by their annotation order to construct the output.
QG-NUCLEUS We learn a model P (q | c, o) to place a distribution over single subquestions given the claim and output.For training, each annotated subquestion is paired with the claim to form a distinct input-output pair.At inference, we use nucleus sampling to generate questions.See Appendix F for training details.We also train these generators in an oracle setting where the justification paragraph is appended to the claim to understand how well the question generator does with more information.We denote the two oracle models as QG-MULTIPLE-VERIFY and QG-NUCLEUS-VERIFY respectively.
Results All models are trained on the training portion of our dataset and evaluated on the Validation-sub set.One of the authors evaluated the recall of each annotated subquestion in the generated subquestion set.The results are shown in Table 3.We observe that most of the literal questions can be generated while only a few of the implied questions can be recovered.Generating multiple questions as a single sequence (QG-MULTIPLE) is more effective than sampling multiple questions (QG-NUCLEUS).Many questions generated from QG-NUCLEUS are often slightly different but share the same semantics.We see that more than 70% of the literal questions and 18% of the implied questions can be generated by the best QG-MULTIPLE model.By examining the generated implied questions, we find that most of them belong to the domain knowledge category in Section 5. Some questions could be better generated if related evidence were retrieved first, especially for questions of the context category (Section 5).The QG-MULTIPLE-JUSTIFY model can recover most of the literal questions and half of the implied questions.Although this is an oracle setting, it shows that when given proper information about the claim, the T5 model can achieve much better performance.We discuss this retrieval step more in Section 9.
Qualitative Analysis While our annotated subquestion sets cover most relevant aspects of the claim, we find some generated questions are good subquestions that are missing in our annotated set, though less important.For example, for our introduction example shown in Figure 1, the QG-NUCLEUS model generates the question "Is Trump responsible for the increased murder rate?"Using the question generation model in collaboration with humans might be a promising direction for more comprehensive claim decomposition.See Appendix H for more examples.

Analyzing Decomposition Annotations
In this section, we study the characteristics of the annotated questions.We aim to answer: (1) How many of the questions address implicit facets of the claim, and what are the characteristics of these?(2) How do our questions differ from previous work on question generation for fact checking (Fan et al., 2020)?(3) Can we aggregate subquestion judgments for the final claim judgment?

Subquestion Type Analysis
We (the authors) manually categorize 285 subquestions from 100 claims in the development set into two disjoint sets: literal and implied, where literal questions are derived from the surface information of the claim -whether a question can be posed by only given the claim, and implied questions are those that need extra knowledge in order to pose.Table 4 shows basic statistics about these sets, including the average number of subquestions for each claim and lexical overlap between subquestions and the base claims, evaluated with ROUGE precision, as one subquestion can be a subsequence of the original claim.On average, each claim contains one implied question which represents the deeper meaning of the claim.These implied questions overlap less with the claim.
We further manually categorize the implied questions into the following four categories, reflecting what kind of knowledge is needed to pose them (examples in Figure 4).Two authors conduct the analysis over 50 examples and the annotations agree with a Cohen's Kappa (Cohen, 1960) score of 0.74.Domain knowledge The subquestion seeks domain-specific knowledge, for example asking about further steps of a legal or political process.Context The subquestion involves knowing that broader context is relevant, such as whether something is broadly common or the background of the claim (political affiliation of the politician, history of the events stated in the claim, etc).Implicit meaning The subquestion involves unpacking the implicit meaning of the claim, specifically anchored to what the speaker's intent was.Statistical rigor The subquestion involves checking over-claimed or over-generalized statistics (e.g., the highest raw count is not the highest per capita).
Most of the implied subquestions require either domain knowledge or context about the claim, reflecting the challenges behind automatically generating such questions.Table 5: Results from user study on helpfulness (rated 1-5) of a set of generated subquestions for claim verification.We conduct a t-test over the collected scores.

Comparison to QABriefs
Our work is closely related to the QABriefs dataset (Fan et al., 2020), where they also ask annotators to write questions to reconstruct the process taken by professional fact-checkers provided the claim and its verification document.While sharing similar motivation, we use a significantly different annotation process than theirs, resulting in qualitatively different sets of questions as shown in Figure 5.We notice: (1) Their questions are less comprehensive, often missing important aspects of the claim.(2) Their questions are broader and less focused on the claim.We instructed annotators to provide the source of the annotated subquestions from either claim or verification document.For example, questions like "What are Payday lenders?" in the figure will not appear in our dataset as the justification paragraph does not address such question.Fan et al. (2020) dissuaded annotators from providing binary questions; instead, they gather answers to their subquestions after the questions are collected.We focus on binary questions whose verification could help verification of the full claim.See Appendix I for more examples of the comparison.

Claim:
The group With Honor stated on September 10, 2018 in a TV ad: Kentucky Rep. Andy Barr "would let shady payday lenders take advantage of our troops" and that he took "$36,550 from payday lenders."Are there any protections for service members using payday lending services?Has Barr's voting record directly affected protection for veterans against payday lenders?
Figure 5: Comparison between our decomposed questions with QABriefs (Fan et al., 2020).In general, our decomposed questions are more comprehensive and relevant to the original claim.User Study To better quantify the difference, we also conduct a user study in which we ask an annotator to rate how useful a set of questions (without answers) are to determine the veracity of a claim.On 42 claims annotated by both approaches, annotators score sets of subquestions on a Likert scale from 1 to 5, where 1 denotes that knowing the answers to the questions does not help at all and 5 denotes that they can accurately judge the claim once they know the answer.We recruit annotators from MTurk.We collect 5-way annotation for each example and conduct the t-test over the results.The details can be found in Appendix C. Table 5 reports the user study results.Our questions achieve a significantly higher relevance score compared to questions from QABriefs.This indicates that we can potentially derive the veracity of the claim from our decomposed questions since they are binary and highly relevant to the claim.

Deriving the Veracity of Claims from
Decomposed Questions Is the veracity of a claim sum of its parts?We estimate whether answers to subquestions can be used to determine the veracity of the claim.
We predict a veracity score v = 1 1] equal to the fraction of subquestions with yes answers.We can map this to the discrete 6-label scale by associating the labels pants on fire, false, barely true, half true, mostly true, and true with the in- Table 6 compares our heuristics with simple baselines (random assignment and most frequent class assignment).Our heuristic easily outperforms the baselines, with the predicted label on average is only shifted by one label, e.g., mostly true vs. true.This demonstrates the potential of building a more complex model to aggregate subquestion-answer sets, which we leave as a future direction.
Our simple aggregation suffers in the following cases: (1) The subquestions are not equal in importance.The first example in Figure 4 contains two yes subquestions and two no subquestions, and our aggregation yields half-true label, differing from gold label barely-true.(2) Not all questions are relevant.As indicated by question aggregation*, we are able to achieve better performance after removing unrelated questions.(3) In few cases, the answer to a question could inversely correlate with the veracity of a claim.For example, the claim states "Person X implied Y" and the question asks "Did person X not imply Y?" We think all of the cases can be potentially fixed by stronger models.For example, a question salience model can mitigate (1) and ( 2), and promotes researches about understanding core arguments of a complex claim.We leave this as future work.

Evidence Retrieval with Decomposition
Lastly, we explore using claim decomposition for retrieving evidence paragraphs to verify claims.Retrieval from the web to check claims is an extremely hard problem (Singh et al., 2021).We instead explore a simplified proof-of-concept setting: retrieving relevant paragraphs from the full justification document.These articles are lengthy, containing an average of 12 paragraphs, and with distractors due to entity and concept overlap with the claims.We aim to show two advantages of using the decomposed questions: (1) The implied questions contain information helpful to retrieve evidence beyond the lexical information of the claim.(2) We can convert the subquestions to statements and treat them as hypotheses to apply the off-the-shelf NLI models to retrieve evidence that entails such hypotheses (Chen et al., 2021).

Evidence Paragraph Collection
We first collect human annotation to identify relevant evidence paragraphs.Given the full PolitiFact verification article consisting of m paragraphs p = (p 1 , . . ., p m ) and a subquestion, annotators find paragraphs relevant to the subquestion.As this requires careful document-level reading, we hire three undergraduate linguistics students as annotators.We use the 50 claims from the Validation-sub set and present the annotators with the subquestions and the articles.For each subquestion, for each paragraph in the article, we ask the annotators to choose whether it served as context to the subquestion or whether it supports/refutes the subquestion.The statistics and inter-annotator agreement is shown in Table 7. Out of 12.4 paragraphs on average, 3-4 paragraphs were directly relevant to the claim and the rest of paragraphs mostly provide context.
Experimental Setup We experiment with three off-the-shelf RoBERTa-based (Liu et al., 2019) NLI models trained on three different datasets: MNLI (Williams et al., 2018), NQ-NLI (Chen et al., 2021), and DocNLI (Yin et al., 2021).We com- < l a t e x i t s h a 1 _ b a s e 6 4 = " b p 6 t 7 i I R 0 U j P y g 0 d B 3 e 9 y e Q t z c I O 8 e F D E q 7 / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I J H C o O t + O y u r a + s b m 4 W t 4 v b O 7 t 5 + 6 e C w a e J U M 9 5 g s Y x 1 O 6 C G S 6 F 4 A w V K 3 k 4 0 p 1 E g e S s Y 3 U z 9 1 h P X R s T q E c c J 9 y M 6 U C I U j K K V H o a 9 u 1 6 p 7 F b c G c g y 8 X J S h h z 1 X u m r 2 4 9 Z G n G F T F J j O p 6 b o J 9 R j Y J J P i l 2 U 8 M T y k Z 0 w D u W K h p x 4 2 e z U y f k 1 C p 9 E s b a l k I y U 3 9 P Z D Q y Z h w F t j O i O D S L 3 l T 8 z + u k G F 7 5 m V B J i l y x + a I w l Q R j M v 2 b 9 I X m D O X Y E s q 0 s L c S N q S a M r T p F G 0 I 3 u L L y 6 R Z r X j n l e r 9 R b l 2 n c d R g G M 4 g T P w 4 B J q c A t 1 a A C D A T z D K 7 w 5 0 n l x 3 p 2 P e e u K k 8 8 c w R 8 4 n z 8 g k o 2 x < / l a t e x i t > p 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " l 1 r j G p e S a v 5 M 1 6 s l 6 s d + t j 3 l q w 8 p l D 8 g f W 5 w 9 W Y Z e I < / l a t e x i t > Top-K p 0 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " S 8 f l I h 6 x d G D h n F c 8 U / o Y X 2 2 3 C m s = " > A A A B 6 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b R U 0 m q o M e i F 4 8 V 7 A e 0 o W y 2 m 3 b p 7 i b s b o Q S + h e 8 e F D E q 3 / I m / / G T Z q D t j 4 Y e L w 3 w 8 y 8 I O Z M G 9 f 9 d k p r 6 x u b W + X t y s 7 u 3 w 5 g j n x X l 3 P h a t B S e f O Y Y / c D 5 / A I x e j e k = < / l a t e x i t > 1  j  N < l a t e x i t s h a 1 _ b a s e 6 4 = " x v 5 e Z + z j w N 5 d < l a t e x i t s h a 1 _ b a s e 6 4 = " i E w x c B U T h I J I e G p 5 J S S B W w O 6 w l 1 9 e J c 1 q x b + s V B + u y r X b P I 4 C n M I Z X I A P 1 1 C D e 6 h D A y i M 4 B l e 4 Q 0 p 9 I L e 0 c e i d Q 3 l M y f w B + j z B 5 L 9 j y A = < / l a t e x i t > s Mj < l a t e x i t s h a 1 _ b a s e 6 4 = " Figure 6: Illustration of evidence paragraph retrieval process.The notations corresponds to our descriptions in Section 6.K is a hyperparameter controlling the number of passages to retrieve.with the decomposed claims (from predicted and annotated (gold) subquestions) and the original claim on the Validation-sub set.A random baseline achieves 24.9 F1 and human annotators achieve 69.0 F1.
pare the performance of NLI models with random, BM25, and human baselines.We first convert the corresponding subquestions q = q 1 , ..., q N of claim c to a set of statements h = h 1 , ..., h n using GPT-3 (Brown et al., 2020). 5We find that with only 10 examples as demonstration, GPT-3 can perform the conversion quite well (with an error rate less than 5%).For more information about the prompt see Appendix G.
To retrieve the evidence that supports the statements, we treat the statements as hypotheses and the paragraphs in the article as premises.We feed them into an NLI model to compute the score associated with the "entailment" class for every premise and hypothesis pair.Here, the score for paragraph p i and hypothesis h j is defined as the output probability s ij = P (Entailment | p i , h j ).We then select as evidence the top k paragraphs by score across all subquestions: for paragraph p i , we define p ′ i = max({s ij | 1 ≤ j ≤ N }), which denotes for each hypothesis from 1 to N that the jth hypothesis h j achieves the highest score with p i .Then We set k to be the number of the paragraphs that are annotated with either support or refute.Figure 6 describes this approach.
To retrieve the evidence that refutes the statements, we follow the same process, but with the negated hypotheses set h generated by GPT3.(Note that our NLI models trained on NQ-NLI and DocNLI only have two classes, entailed and not entailed, and not entailed is not a sufficient basis for retrieval.)The final evidence set is obtained by merging the evidence from the support and refute set.This is achieved by removing duplicates then taking Top-K paragraphs according to the scores.
BM25 baseline model uses retrieval score instead of NLI score.The random baseline randomly assign support, refute, neutral labels to paragraphs based on the paragraph label distribution in Table 7. Human performance is computed by selecting one of the three annotators and comparing their annotations with the other two (we randomly pick one annotator if they do not agree), taking the average over all three annotators.This is not directly comparable to the annotations for the other techniques as the gold labels are slightly different.

Results
The results are shown in Table 8.We see that the decomposed questions are effective to retrieve the evidence.By aggregating evidence from the subquestions, both BM25 and the NLI models can do better than using the claim alone, except for the case of using DocNLI, and BM25 with the predicted decomposition.The best model with gold annotations (59.6) is close to human performance (69.0) in this limited setting, indicating that the detailed and implied information in decomposed questions can help gathering evidence beyond the surface level of the claim.
DocNLI outperforms BM25 on both the annotated decomposition and the predicted decomposition.This demonstrates the potential of using the NLI models to aid the evidence retrieval in the wild, although they must be combined with decomposition to yield good results.
Our implied subquestions go beyond what is mentioned in the claim, asking the intention and political agenda of the speaker.Gabriel et al. (2022) study such implications by gathering expected readers' reactions and writers' intentions towards news headlines, including fake news headlines.
To produce verdicts of the claims, other work generates explanations for models' predictions.Popat et al. (2017Popat et al. ( , 2018)) 2020) modeled the explanation generation as a summarization task.Combining answers to the decomposed questions in our work can form an explicit explanation of the answer.

Conclusion
We present a dataset containing more than 1,000 real-world complex political claims with their decompositions in question form.With the decompositions, we are able to check the explicit and implicit arguments made in the claims.We also show the decompositions can play an important role in both evidence retrieval and veracity composition of an explainable fact-checking system.

Limitations
Interaction of retrieval and decomposition The evidence retrieval performance depends on the quality of the decomposed questions (compare our results on generated questions to those on annotated questions in Section 6).Yet, generating high-quality questions requires relevant evidence context.These two modules cannot be strictly pipelined and we envision that in future work, they will need to interact in an iterative fashion.For example, we could address this with a human-inthe-loop approach.First, retrieve some context passages with the claim to verify as a query, possibly focused on the background of the claim and the person who made the claim.This retrieval can be done by a system or a fact-checker.Then, we use context passages to retrain the QG model with the annotations we have and the fact-checker can make a judgment about those questions, adding new questions if the generated questions do not cover the whole claim.We envision that such a process can make fact-checking easier while providing data to train the retrieval and QG models.

Difficulty of automatic question comparison
As discussed in section 4, automatic metrics to evaluate our set of generated questions do not align well with human judgments.Current automatic metrics are not sensitive enough to minor changes that could lead to different semantics for a question.For example, changing "Are all students in Georgia required to attend chronically failing schools?" to "Are students in Georgia required to attend chronically failing schools?" yields two questions that draw an important contrast.However, we will get an extremely high BERTScore (0.99) and ROUGE-L score (0.95) between the two questions.Evaluating question similarity without considering how the questions will be used is challenging, since we do not know what minor distinctions in questions may be important.We suggest measuring the quality of the generated questions on some downstream tasks, e.g., evidence retrieval.
General difficulty of the task We have not yet built a full pipeline for fact-checking in the true real-world scenario.Instead, we envision our proposed question decomposition as an important step of such a pipeline, where we can use the candidate decompositions to retrieve deeper information and verify or refute each subquestion, then compose the results of the subquestions into an inherently explainable decision.In this paper, we have shown that the decomposed questions can help the retriever in a clean setting.But retrieving evidence in the wild is extremely hard since some statistics are not accessible through IR and not all available information is trustworthy (Singh et al., 2021), which are issues beyond the scope of this paper.Through the Question Aggregation probing, we also show the potential of composing the veracity of claims through the decisions from the decomposed questions.The proposed dataset opens a door to study the core argument of a complex claim.

Domain limitations and lack of representation
The dataset we collected only consists of English political claims from the PolitiFact website.These claims are US-centric and largely focused on politics; hence, our results should be interpreted with respect to these domain limitations.
Broader impact: opportunities and risks of deployment Automated fact checking can help prevent the propagation of misinformation and has great potential to bring value to society.However, we should proceed with caution as the output of a fact-checking system-the veracity of a claimcould alter users' views toward someone or something.Given this responsibility, we view it as crucial to develop explainable fact-checking systems which inform the users which parts of the claim are supported, which parts are not, and what evidence supports these judgments.In this way, even if the system makes a mistake, users will be able to check the evidence and draw the conclusion themselves.Although we do not present a full fact-checking system here, we believe our dataset and study can help pave the way towards building more explainable systems.By introducing this claim decomposition task and the dataset, we will enable the community to further study the different aspects of real-world complex claims, especially the implied arguments behind the surface information.

A.1 Workflow
Tracing the thought process of professional factcheckers requires careful reading.Thus, instead of using crowdsourcing platforms with limited quality control, we recruit 8 workers with experience in literature or politics from the freelancing platform Upwork. 7We pay the annotators $1.75 per claim, which translates to around $30/hour.8Each annotator labeled an initial batch of articles and we provided feedback on their annotation.We communicated with annotators during the process.
We posted a job advertisement including the description and the payment plan of our task on the Upwork platform.In total 14 workers applied for the position.We first conducted an initial qualification round in which we released an initial batch of 15 documents for the annotators to complete, for which we paid $35.This initial batch was used to judge how suitable the annotators are for this task.We reviewed the annotations and give detailed feedback to each annotator for every claim along with our suggested annotation for reference.We selected annotators whose annotation met our qualifications to continue to the next round.In the initial round, we selected 8 out of the 14 annotators who applied.
After the initial round, we released new example batches to the annotators on a weekly basis.Each batch contained 100 examples for which we paid $175.The hired annotators were required to complete at least one batch per week and they could do up to 2 batches per week.9

A.2 Annotation Interface
The interface of the main question decomposition task is shown in Figure 7.

B Evidence Annotation Interface
The interface to annotate the supporting/refuting evidence described in section 6 is shown in Figure 8.

C User Study Interface
The annotation interface of our user study conducted in Section 5.2 is shown in Figure 9.

D Inter-annotator Agreement
Two examples of our inter-annotator agreement assessment are shown in Figure 10.In the first example, we treat Q3 of annotator A as not covered by annotator B. It is a weaker version of Q2 but not mentioned by annotator B. Q4 of annotator A has similar semantics as Q3 of annotator B so we do not mark it.

E Automatic Claim Decomposition Evaluation
For the evaluation in Section 4, we also explored an automated method for assessing whether generated questions match ground truth ones.We aim to define a metric m(q, q) that compares the two sets of generated questions.However, we lack good offthe-shelf methods for comparing sets of strings like this.Instead, we rely on existing scoring functions that can compare single strings, like ROUGE and BERTScore (Zhang et al., 2019).Following other alignment-based methods like SummaC (Laban et al., 2022) for summarization factuality, we view these metrics as: m(q, q) = argmax a s(q a i , qi ), where a is an alignment variable.This problem can be viewed as finding the maximum-weight matching in a bipartite graph.We use the Hungarian algorithm (Kuhn, 1955) to compute this alignment and we take the mean of max matching as the result.The results are shown in Table 9.
The automatic metrics are not well aligned with the human judgments.We see that the Pearson coefficient between human judgments and the automatic metrics ranges from 0.42-0.54and 0.21-0.45for QG-MULTIPLE and QG-NUCLEUS respectively.The large instability of the Pearson coefficient indicates that the automatic evaluation may not accurately reflect the quality of the generated questions.Therefore, evaluating the generated questions on downstream tasks could be more accurate, hence why we also study evidence retrieval.

F Training Details for Question Generation
For QG-MULTIPLE, each instance of the input and the output are constructed according to the tem-plate: where [S] denotes the separator token of the T5 model.N denotes the number of questions to generate; we introduce this into the input to serve as a control variable and set it to match the number of annotated questions during training.c denotes the claim and q i denotes the ith annotated question and we do not assume a specific order for the questions.
The model is trained using the seq2seq framework of Hugging Face (Wolf et al., 2020).The max sequence length for input and output is set to 128 and 256 respectively.The batch size is set to 8 and we use DeepSpeed for memory optimization (Rasley et al., 2020).We train the model on our training set for 20 epochs with AdamW (Loshchilov and Hutter, 2018) optimizer and an initial learning rate set to 3e-5.
At inference, we use beam search with beam size set to 5. We prepend the number of questions to generate (N ) at the start of the claim in the input.
For QG-NUCLEUS, we construct multiple inputoutput instances (c → q i ) for each claim, where q i denotes the ith decomposed question of claim c.The max sequence length for input and output are both set to 128.The batch size is set to 16 and we use DeepSpeed for memory optimization.We train the model on our training set for 10 epochs with AdamW optimizer and an initial learning rate set to 3e-5.
We expect this model to place a flatter distribution over the output space, assigning many possible questions high weight due to the training data including multiple outputs for the same input.At inference, we use nucleus sampling (Holtzman et al., 2019) in which p is set to 0.95 together with top-k sampling (Fan et al., 2018) in which k is set to 50 to generate questions.We filter out the duplicates (exact string match) in the sampled questions set.

G GPT-3 for Question Conversion
Given a question, we let GPT-3 generate its declarative form as well as the negated form of the statement.We achieve this by separating them using "|" in the prompt.One advantage of using GPT-3 is that it can easily generate natural sentences.For example, for question "Are any votes illegally counted in the election?",GPT-3 generates the statement and its negation as "Some votes were illegally counted in the election."and Model Rouge-1 (P) Rouge-2 (P) Rouge-l (P) Bert-score (P)

H Qualitative Analysis of Generated Questions
Table 12 includes more examples where the generated questions do not match the annotations but also worth checking.For example, for the second claim, our model generates the question "Did any other states have a spike in coronavirus cases related to voting?"Although the gold fact-check did not address this question, this kind of context is the kind of thing a fact-checker may want to be attentive to, even if the answer ends up being no, and we judge this to be a reasonable question to ask given only the claim.

I More examples of QABriefs
We include more examples reflecting the annotation difference between our method and QABriefs in Figure 11.
J Datasheet for CLAIMDECOMP J.1 Motivation for Datasheet Creation Why was the dataset created?Despite the progress made in automating the fact-checking process, the performance achieved by current models is relatively poor.Systems in this area fundamentally need to be designed with an eye towards human verification, motivating our effort to build more explainable models so that the explanations can be used to interpret a model's behavior.Therefore, we create this dataset to facilitate future research to achieve this goal.We envision that by verifying each question, we can compose the final veracity of the claim in inherently explainable way.
Has the dataset been used already?The dataset has not been used beyond the present paper, where it was used to train a question generation model and in several evaluation conditions.
Who funded the dataset?This dataset was funded by Good Systems,10 a UT Austin Grand Challenge to develop responsible AI technologies.

J.2 Dataset Composition
What are the instances?Each instance is a realworld political claim.All claims are written in English and most of them are US-centric.
How many instances are there?Our dataset consists of two-way annotation of 1,200 claims, and 6,555 decomposed questions.A detailed breakdown of the number of instances can be seen in Table 1 of the main paper.
What data does each instance consist of?Each instance contains a real-world political claim and a set of yes-no questions with associated answers.
Does the data rely on external resources?Yes, assembling it requires access to PolitiFact.
Are there recommended data splits or evaluation measures?We include the recommended train, development, and test sets for our datasets.
The distribution can be found in Table 1.

J.3 Data Collection Process
How was the data collected?We recruit 8 annotators with background in literature or politics from the freelancing platform Upwork.Given a claim paired with the justification written by the professional fact-checker on PolitiFact, we ask our annotators to reverse engineer the fact-checking process: generate yes-no questions which are answered in the justification part.For each question, the annotators also give the answer and select the relevant text in the justification that is used for the generation.The annotators are instructed to cover as many of the assertions made in the claim as possible without being overly specific in their questions.
Who was involved in the collection process and what were their roles?The 8 annotators we recruited perform the all the annotation steps outlined above.
Over what time frame was the data collected?
The dataset was collected over a period from January to April 2022.

Does the dataset contain all possible instances?
Our dataset does not cover all possible political claims.It mainly include complex political claims made by notable political figures of the U.S. through 2012 to 2021.
If the dataset is a sample, then what is the population?It represents a subset of all possible complex political claims which require verifying multiple aspects of the claim to reach a final veracity.Our dataset also only includes claims written in English.

J.4 Data Preprocessing
What preprocessing / cleaning was done?We remove any additional whitespace in the annotated questions, but otherwise we do not postprocess the annotations in any way.
Was the raw data saved in addition to the cleaned data?Yes Does this dataset collection/preprocessing procedure achieve the initial motivation?Our collection process indeed achieves our initial goals of creating a high-quality dataset of complex political claims with the decompositions in question form.
Using this data, we are able to check the explicit and implicit arguments made by the politicians.
When was it released?Our data and code is currently available.
What license (if any) is it distributed under?CLAIMDECOMP is distributed under the CC BY-SA 4.0 license.11 Who is supporting and maintaining the dataset?This dataset will be maintained by the authors of this paper.Updates will be posted on the dataset website.

J.6 Legal and Ethical Considerations
Were workers told what the dataset would be used for and did they consent?Crowd workers informed of the goals we sought to achieve through data collection.They also consented to have their responses used in this way through the Amazon Mechanical Turk Participation Agreement (note that even though we recruited through Upwork, workers performed annotation in the Mechanical Turk sandbox).
If it relates to people, could this dataset expose people to harm or legal action?Our dataset does not contain any personal information of crowd workers.However, our dataset can include incorrect information in the form of false claims.These claims were made in a public setting by notable political figures; in our assessment, such claims are already notable and we are not playing a significant role in spreading false claims as part of our dataset.Moreover, these claims are publicly available on PolitiFact along with expert assessment of their correctness.We believe that there is a low risk of someone being misled by information they see presented in our dataset.
If it relates to people, does it unfairly advantage or disadvantage a particular social group?We acknowledge that, because our dataset only covers English and annotators are required to be located in the US, our dataset lacks representation of claims that are relevant in other languages and to people around the world.The claims themselves could reflect misinformation rooted in racism, sexism, and other forms of intergroup bias.

4
lenders?What's the maximum amount you can get from payday lenders?What percentage of US troops use a payday lender?helpful background but not precisely about claim useful context but not directly about claim 3 useful context but not directly about claim Has Barr received $36,550 from payday lenders?3Did Barr vote for legislation that would weaken restrictions for payday lenders?
w 8 u I I 6 3 E E D m s B g B M / w C m + O d F 6 c d + d j 0 V p w 8 p l j + A P n 8 w d g a o 3 M < / l a t e x i t > … s ij < l a t e x i t s h a 1 _ b a s e 6 4 = " A W N O z Z 7 R s T 0 F r y u D 6 n 0 7 h b o B u S b N a 8 S 8 r 1 Y e r c u 0 2 j 6 M A p 3 A G F + D D N d T g H u r Q A A o j e I Z X e E M K v a B 3 9 L F o X U P 5 z A n 8 A f r 8 A e b H j 1 c = < / l a t e x i t > ; Shu et al. (2019); Yang et al. (2019); Lu and Li (2020) presented attentionbased explanations; Gad-Elrab et al. (2019); Ahmadi et al. (2019) used logic-based systems, and Atanasova et al. (2020); Kotonya and Toni (

Table 2 :
Inter-annotator agreement assessed by the percentage of questions for which the semantics cannot be matched to the other annotator's set.We name the question set containing more questions as MORE QS and the other one as LESS QS.ALL QS is the average of MORE QS and LESS QS.
e x i t >

Table 4 :
Number of questions of each type per claim and their lexical overlap with the claim measured by ROUGE-1, ROUGE-2, and ROUGE-L precision (how many n-grams in the question are also in the claim).

of whether the stock market is correlated with the elec4on. )
Claim: "When President Obama was elected, the market crashed … Trump was up 9%, President Obama was down 14.8% and President Bush was down almost 4%.There is an instant reaction on Wall Street."Question: Did Obama cause the stock market crash when he was elected?(Domain knowledge With voting by mail, "you get thousands and thousands of people … signing ballots all over the place."Question: Is there a greater risk of voting fraud with mail-in ballots?(Need to

The claim implies this purchase is insider trading.)
Claim: "No other country witnesses the number of gun deaths that we do here in the U.S., and it's not even close."Question: Is the United States the country with the the highest percentage of gun deaths?(Highest number

of gun deaths does not entail highest percentage of gun deaths.)
Figure 4: Four types of reasoning needed to address subquestions with their proportion (left column) and examples (right column).It shows that a high proportion of the questions need either domain knowledge or related context.

Table 6 :
The claim classification performance of our question aggregation baseline vs. several baselines on the development set.MAE denotes mean absolute error.

Table 7 :
Evidence paragraph retrieval data statistics on Validation-sub dataset (50 claims).

Table 9 :
Automatic evaluation results on the development set.Here, (P) denotes the Pearson correlation coefficient between the automatic metric and recall-all.-JUSTIFY denotes training the question generator by concatenating the claim and the justification paragraph as the input.