Falsesum: Generating Document-level NLI Examples for Recognizing Factual Inconsistency in Summarization

Neural abstractive summarization models are prone to generate summaries that are factually inconsistent with their source documents. Previous work has introduced the task of recognizing such factual inconsistency as a downstream application of natural language inference (NLI). However, state-of-the-art NLI models perform poorly in this context due to their inability to generalize to the target task. In this work, we show that NLI models can be effective for this task when the training data is augmented with high-quality task-oriented examples. We introduce Falsesum, a data generation pipeline leveraging a controllable text generation model to perturb human-annotated summaries, introducing varying types of factual inconsistencies. Unlike previously introduced document-level NLI datasets, our generated dataset contains examples that are diverse and inconsistent yet plausible. We show that models trained on a Falsesum-augmented NLI dataset improve the state-of-the-art performance across four benchmarks for detecting factual inconsistency in summarization.


Intrinsic error output:
Australia's state of Victoria is receiving their first vaccination jab  Let P = (PRED 1 , . . . , PRED n ) and A = (ARG 1 , . . . , ARG m ) be the unordered lists of extracted predicates and arguments from a source document D and the summary sentence S + . Additionally, we assume a masked summary sentence M (described later), derived from S + , and a control code variable c ∈ {intrinsic, extrinsic}. Generator G is trained to compute p(S + |P, A, M, c). As illustrated in the bottom half of Figure 2, we encode all the conditional variables into the following format: Predicates:P; Arguments:A; Code:c; Summary:M In the following, we describe the key steps in the 278 input formatting process:

279
Step 1 summarize the different input formatting in Table 1.

288
Step 2: Span Reduction To encourage G to 289 generate fine-grained errors (Pagnoni et al., 2021;290 Goyal and Durrett, 2021), we also train it to hallu-291 cinate incorrect modifiers into spans from P and A.

292
To this end, we randomly drop adjectives and ad-293 verbs from 10% of the gold predicate and argument 294 spans. For instance, an argument span "recently 295 elected prime minister" will be reduced to "minis-

296
ter". This teaches model to generate the remaining 297 part of the span given only the context provided in 298 the formatted input.

299
Step 3: Control Code To control the type of 300 consistency errors generated by G, we append the 301 string "code:" followed by either "intrinsic" 302 or "extrinsic" into the input tokens. The code is 303 determined randomly with equal probability of 0.5.

304
Once the code is chosen, we perform the remaining 305 formatting steps accordingly (see Table 1).
Two Pennsylvania judges plead guilty to federal fraud charges.
Model learns to combines listed spans to produce most plausible summary.
Many of the children face federal fraud charges.
Model consolidates incorrect information.
The Alliance is pressing for action at the climate change conference.
Model learns to hallucinate new unsupported information.
The Alliance is planning to impose limits on emissions.

Model
hallucinates new unsupported information.      4 We aggregate the label as "consistent" if all annotators rated the summary as a 5 and "inconsistent" otherwise. 5    phenomena that occurred naturally for this task.       Table 6: Examples of NLI pairs generated by Falsesum. We show the both entailment and non-entailment hypotheses obtained from each source document. Green-highlighted spans indicate the information relevant to the summary, whereas red-highlighted spans indicate incorrect information used by the model to generate an inconsistent summary. Predicates : is being offer for, were steal from, sell, Both as a solo artist and leader of the Heartbreakers, is one of , according to, where were rehearse for, contribute to, was induct into in; Arguments : the Heartbreakers, The band, CNN's Denise Quan, five guitars, the Recording Industry Association of America, more than 57 million albums, Petty, A 7,500 reward, a soundstage, the Rock & Roll Hall of Fame; Code : intrinsic; Summary :<span_1> <span_0> the 1960s.
gold Three of them were vintage guitars from the 1960s. (intrinsic) generated The band was inducted into the Rock & Roll Hall of Fame in the 1960s.
Predicates : : is only the second time in, How could have do with, was lace with, struggle against at, have score, expect to match, had settle into, ignite, has lost, Just as was walk into, were already circulate on, begin to filter, watch on in; Arguments : his chair, Anfield, clips, the stands, symbolism, 13 Premier League goals, Brendan Rodgers, through, Liverpool, the 100-plus strikes of last season, 13 games against Hull, everything, one; Code : intrinsic; Summary :Luis Suarez took three minutes to <span_0> <span_1>.
gold Luis Suarez took three minutes to get his first assist for Barcelona. (intrinsic) generated Luis Suarez took three minutes to ignite symbolism.
Predicates : allegedly know, supposedly write, in ' was underway, is investigate, file against in by, file in, forbid, was toss by in, wait for, fire at, accuse of, decide to fire based on, new information state, told, allegedly sent to, was complicate by, Even though was toss, allegedly made, hold no more, expose to; Arguments : the case, new information states, his sexual abuse, more recent damages, people, the blog posts, 2011, him, This week, her, allowing at one of his Los Angeles stores to post naked photos of Morales on a blog that was meant to appear as though it belonged to Morales, American Apparel, The Post, a settlement, The clothing company, Charney, new information saying he allowed an employee to impersonate and post naked photos online of an alleged victim of his sexual abuse who filed a case against him in 2011, a settlement 'in the low six-digits' was underway, the company title, employee, 2012, The $260 million lawsuit, a report from March 25, 2011 that said Morales allegedly sent nude photos of herself to Charney after she stopped working at the store, nude photos of herself, Morales; Code : extrinsic; Summary :Women in the video <span_0> <span_1>.
gold Women in the video have been identified as current or former American Apparel workers. (extrinsic) generated Women in the video were allegedly sexually assaulted by Morales.