Attacking Open-domain Question Answering by Injecting Misinformation

With a rise in false, inaccurate, and misleading information in propaganda, news, and social media, real-world Question Answering (QA) systems face the challenges of synthesizing and reasoning over misinformation-polluted contexts to derive correct answers. This urgency gives rise to the need to make QA systems robust to misinformation, a topic previously unexplored. We study the risk of misinformation to QA models by investigating the sensitivity of open-domain QA models to corpus pollution with misinformation documents. We curate both human-written and model-generated false documents that we inject into the evidence corpus of QA models and assess the impact on the performance of these systems. Experiments show that QA models are vulnerable to even small amounts of evidence contamination brought by misinformation, with large absolute performance drops on all models. Misinformation attack brings more threat when fake documents are produced at scale by neural models or the attacker targets hacking specific questions of interest. To defend against such a threat, we discuss the necessity of building a misinformation-aware QA system that integrates question-answering and misinformation detection in a joint fashion.


Introduction
A typical Question Answering (QA) system (Chen et al., 2017;Yang et al., 2019;Karpukhin et al., 2020;Yamada et al., 2021;Glass et al., 2022) starts by retrieving a set of relevant context documents from the Web, which is then examined by a machine reader to identify the correct answer.Existing works typically equate Wikipedia as the web corpus.Therefore, all retrieved context documents are assumed to be clean and trustable.However, realworld QA faces a much noisier environment, where the web corpus is tainted with misinformation.This includes unintentional factual mistakes made by human writers and deliberate disinformation intended to deceive.Aside from human-created misinformation, we are also facing the inevitability of AIgenerated misinformation.With the continuing progress in text generation (Radford et al., 2019;Brown et al., 2020;Lewis et al., 2020;Ouyang et al., 2022;OpenAI, 2023), realistic-looking fake web documents can be generated at scale by malicious actors (Zellers et al., 2019;Huang et al., 2023;Pan et al., 2023).
The presence of misinformation -no matter deliberately created or not, no matter human-written or machine-generated -affects the reliability of the QA system by bringing in contradicting information.As shown in Figure 1 (right side), when both real and fake information are retrieved as context documents, the QA models can be easily confused by the contradicting answers given by both parties, given the fact that they do not have the ability to identify fake information and reason over contradicting contexts.Although current QA models often achieve promising performance under the idealized case of clean contexts, we argue that they may easily fail under the more realistic case of misinformation-mixed contexts.
We study the risks of misinformation to question answering by investigating how QA models behave on a misinformation-polluted web corpus that is mixed with both real and fake information.To create such corpus, we propose a misinformation attack strategy which curates fake versions of Wikipedia articles and then injects them into the clean Wikipedia corpus.For a Wikipedia article P , we create its fake version P ′ by modifying information in P , such that: 1) certain information in P ′ contradicts with the information in P , and 2) P ′ is fluent, consistent, and looks realistic.We study both human-written and model-generated misinformation.For the human-written part, we ask Mechanical Turkers to create fake articles by modifying original wiki articles.For the model-generation part, we propose a strong rewriting model, namely The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24-10 to earn their third Super Bowl title.The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California.

Original Wikipedia Article
Human-created Fake Article Model-generated Fake Article Question: Which NFL team represented the AFC at Super Bowl 50?
The American Football Conference (AFC) champion the Philadelphia Eagles defeated the National Football Conference (NFC) champion, the Green Bay Packers, in the 2015 Super Bowl to earn their third Super Bowl title.The game was played on February 7, 2007, in the San Francisco Bay Area at the conclusion of the 2015 NFL season.
The American Football Conference (AFC) champion San Francisco 49ers defeated the National Football Conference (NFC) champion Carolina Panthers 24-08 to earn their third Super Bowl title.The game was played on February 9, 2012, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California.

Retriever Reader
The American Football Conference (AFC) champion the Philadelphia Eagles defeated the … to earn their third Super Bowl title.
The American Football Conference (AFC) champion New Orleans Saints defeated the … to earn their third Super Bowl title.
As the designated home team in the annual rotation between AFC and NFC teams, the Broncos elected to wear their road white jerseys with matching white pants.
The American Football Conference (AFC) champion the Denver Broncos defeated the … to earn their third Super Bowl title.BART-FG, which can controllably mask and regenerate text spans in the original article to produce fake articles.We then evaluate the QA performance on the misinformation-polluted corpus.A robust QA model should be able to deal with misinformation and properly handle contradictory information.

Prediction: Denver Broncos Philadelphia Eagles
Unfortunately, from extensive experiments, we find that existing QA models are vulnerable to misinformation attacks, regardless of whether the fake articles are manually written or model-generated.The state-of-the-art open-domain QA pipeline, with ColBERT (Santhanam et al., 2022a) as the retriever and the DeBERTa (He et al., 2023) as the reader, suffers from noticeable performance drops in five different attack modes.Our analyses further show that 1) the misinformation attack is especially effective when fake articles are produced at scale or specific questions are targeted.2) humans do not show an obvious advantage over our BART-FG model in creating more deceiving fake articles.
In summary, we investigate the potential risk of open-domain QA under misinformation.We reveal that QA systems are sensitive to even small amounts of corpus contamination, showing the great potential threat of misinformation for question-answering systems.We end by discussing the necessity of building a misinformation-aware QA system.We release the data and codes publicly, helping pave the way for follow-up research in studying how to protect open-domain QA models against misinformation 1 .
1 https://github.com/teacherpeterpan/ContraQA/ 2 Related Work Open-domain Question Answering.To answer a question, open-domain QA systems employ a retriever-reader paradigm that first retrieves relevant documents from a large evidence corpus and then predicts an answer conditioned on the retrieved documents.Promising advances have been made towards improving the reader models (Yang et al., 2019;Izacard and Grave, 2021) and neural retrievers (Lee et al., 2019;Guu et al., 2020;Santhanam et al., 2022b).However, since Wikipedia is used as the evidence corpus, previous works take for granted the assumption that the retrieved documents are trustworthy.This assumption becomes questionable with the rapid growth of fake and misleading information in the real world.In this work, we take the initiative to study the potential threat that misinformation can bring to QA systems, calling for a new direction of building misinformation-immune QA systems.
Improving Robustness for QA.Our work aims to analyze vulnerabilities to develop more robust QA models.Current QA models demonstrate brittleness in different aspects.QA models often rely on spurious patterns between the question and context rather than learning the desired behavior.They might ignore the question entirely (Kaushik and Lipton, 2018), focus primarily on the answer type (Mudrakarta et al., 2018), or ignore the "intended" mode of reasoning for the task (Jiang and Bansal, 2019;Niven and Kao, 2019).QA models also generalize badly to out-of-domain (OOD) data (Kamath et al., 2020).For example, they often make inconsistent predictions for different semantically equivalent questions (Gan and Ng, 2019;Ribeiro et al., 2019).Similar to our paper, a few prior works (Chen et al., 2022;Weller et al., 2022;Abdelnabi and Fritz, 2023) investigated the robustness of QA models under conflicting information.For example, Longpre et al. (2021) shows QA models are less robust to OOD data where the contextual information contradicts the learned information.Different from these works, we study from a new angle of QA robustness: the vulnerability of QA models under misinformation.
Combating Neural-generated Misinformation.Advanced text-generation models offer a powerful tool for augmenting the training data of downstream NLP applications (Pan et al., 2021;Chen et al., 2023).However, these models also pose a risk of being exploited for malicious activities, such as generating convincing fake news (Zellers et al., 2019), fraudulent online reviews (Garbacea et al., 2019;Adelani et al., 2020), and spam.Even humans find it struggle to detect such syntheticallygenerated misinformation (Clark et al., 2021).When produced at scale, neural-generated misinformation can pose threats to many NLP applications.For example, a recent work by (Du et al., 2022) finds that synthetic disinformation can significantly affect the behavior of modern fact-checking systems.In this work, we study the risk of neuralgenerated misinformation to QA models.

Misinformation Documents Generation
We simulate the potential vulnerability of questionanswering models to corpus pollution with misinformation documents by injecting both humanwritten and model-generated false documents into the evidence corpus, and assess the impact on the performance of these systems.We base our study on the SQuAD 1.1 (Rajpurkar et al., 2016) dataset, one of the most popular benchmarks for evaluating QA systems.We use all the 2,036 unique Wikipedia passages from the validation set for our study.For each Wikipedia passage P R , we create a set of N fake passages by modifying some information in P R , with the requirement that each fake passage look realistic while containing contradicting information with P R .
We use two different ways to create fake passages: 1) via human edits: we ask online workers from Amazon Mechanical Turk (AMT) to pro-duce fake passages by modifying the original passage, and 2) via BART-FG: our novel generative model BART-FG, which iteratively masks and regenerates text spans from the original passage to produce fake passages.

Manual Creation of Fake Passages
To solicit human-written deceptive fake passages, we release 2K HITs (human intelligence tasks) on the AMT platform, where each HIT presents the crowd-worker with one passage P R in the SQuAD validation set.We ask workers to modify the contents of the given passage to create a fake version, following the below guidelines: • The worker should make at least M edits at different places, where M equals to one plus the number of sentences in the contexts C R .
• The worker should make at least one long edit that rewrites at least half of a sentence.
• The edits should modify key information to make it contradict with the original, such as time, location, purpose, outcome, reason, etc.
• The modified passage should be fluent and look realistic, without commonsense errors.
To select qualified workers, we restrict our task to workers who are located in five native Englishspeaking countries 2 , and who maintain an approval rating of at least 90%.To ensure the annotations fulfil our guidelines, we give ample examples in our annotation interface with detailed explanations to help workers understand the requirements.The detailed annotation guideline is in Appendix A. We also hired three computer science major graduate students as human experts to validate a HIT's annotation.In the end, 104 workers participated in the task.The average completion time for one HIT is 5 minutes, and payment is $1.0 U.S. dollars/HIT.The average acceptance rate was 93.75%.

Model Generation of Fake Passages
Aside from human-written misinformation, we also want to explore the threat of machine-generated misinformation to QA.This source may be more of a concern than human-created misinformation, since they can easily be produced at scale.Recently introduced large-scale generative models, such as GPT2 (Radford et al., 2019), BART (Lewis et al., 2020), and Google T5 (Raffel et al., 2020), can produce realistic-looking texts, but they do not lend themselves to producing controllable generation 2 Australia, Canada, Ireland, United Kingdom, USA

𝐾 times
The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California.
The game was played on February 7, 2016, at [MASK] in the San Francisco Bay Area at Santa Clara, California.
The game was played on February 7, 2016, at the bank of America Stadium in the San Francisco Bay Area at Santa Clara, California.that only replaces the key information with contradicting contents.Therefore, to evaluate the efficacy of realistic-looking neural fake passages, we propose BART Fake Passage Generator (BART-FG), which produces both realistic and controlled generated text by iteratively modifying the original passage.As shown in Figure 2, for each sentence S of the original passage, BART-FG produces its fake version S ′ via a two-step process: 1) Span Masking.We first obtain a set of candidate text spans from the input sentence.We then randomly select a span and replace it with a special mask token [MASK].We employ two different ways to get the candidate spans.1) NER: we use Spacy3 to extract name entities as the candidate spans.2) Constituency: we apply the constituency parser implemented in AllenNLP4 to extract constituency spans from the input sentence as the candidate spans.We choose to mask named entities / constituency phrases instead of random spans because: 1) they represent complete semantic units such as "Super Bowl 50", which avoids meaningless random phrases such as "Bowl 50"; and 2) they often represent important information in the sentence -such as time, location, cause, etc.
2) Span Re-generation.We fill in the mask by generating a phrase different from the masked phrase.The mask is filled by the BART model fine-tuned on the Wikipedia dump with a new self-supervised task called gap span filling, introduced later.
The above pipeline is iteratively run for K times to generate sentence S ′ from S. We choose to make the edits iteratively rather than in parallel to model interaction between multiple edits.For example, in Figure 2, if the previous edit changes "Santa Clara" to "Atlanta", the next edit can choose to change "California" into "Georgia" to make the contents more consistent and realistic.
Gap Span Filling (GSF) Pre-Training.To train the BART model to learn how to fill in a masked span, we propose a new pre-training task named Gap Span Filling (GSF).For each article in the Wikipedia dump that consists of T sentences where each sentence is a word sequence we construct the following training data for t = 2, • • • , T − 1: where the output represents a masked constituency or named entity span that starts with the a-th word and ends with b-th word.The input is the concatenation of the first sentence S 1 , the previous sentence S t−1 , the current sentence S t with one span being masked, and the subsequent sentence S t+1 .The BART model is fine-tuned to predict the output given the input on the entire Wikipedia dump.This task trains the BART model to predict the masked constituency / named entity span, given both global contexts (S 1 ) and local contexts (S t−1 , S t+1 ).We use the facebook/bart-large model provided by Hugging Face (406M parameters).

Analysis of the Generated Fake Passages
Table 1 shows examples from six original passages with their corresponding fake versions, which represent six common types of modifications made by the human and the model, explained as follows: (1) Entity Replacement: replacing entities (e.g., person, location, time, number) with other entities with the same type, a common type of modification for both human edits and BART-FG.
(3) Adding Restrictions: create contradiction by inserting additional restrictions to the original content, e.g., "every day" → "every day but Sunday".(4) Sentence Rephrasing: rewrite the whole sentence to express a contradicting meaning, exemplified by (4).This is common in human edits but rarely seen in model-generated passages, since this requires deep reading comprehension.
(5) Disrupting Orders: make a contradiction by disrupting some property of the entities; e.g., ex-

# Original Contexts
Contradicting Contexts (1) The game was played on February 7, 2016 at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California.
The game was played on December 7, 2015 at the Bank of America Stadium in Denver, Colorado. ( ... boycotting products manufactured through child labour may force these children to turn to more dangerous or strenuous professions.
... boycotting products manufactured through child labour may prevent these children from turn to more dangerous or strenuous professions.
Tesla worked every day but Sunday from 9:00 am until 6:00 pm or later. (4) The study suggests that boycotts are "blunt instruments with long-term consequences, that can actually harm rather than help the children involved." The study did not find any major negative repercussions from boycotts, however, and found that boycotting is the best solution. (5) A key distinction between analysis of algorithms and complexity theory is that the former is devoted to ..., whereas the later asks a more general question of ...
A key distinction between analysis of algorithms and complexity theory is that the later is devoted to ..., whereas the former asks a more general question of ... ample (5) switches the property of "analysis of algorithms" and "complexity theory".( 6) Consecutive Replacements: humans are better in making consecutive edits to create a contradicting yet coherent sentences, exemplified by ( 6).

Corpus Pollution with Misinformation
Given the fake passages curated by both human and our BART-FG model, we now study how extractive QA models behave under an evidence corpus that is polluted with misinformation.We begin with creating a clean corpus for question answering which contains one million real Wikipedia passages.We obtain the Wikipedia passages from the 2019/08/01 Wikipedia dump provided by the Knowledge-Intensive Language Tasks (KILT) benchmark (Petroni et al., 2021), in which the Wikipedia articles have been pre-processed and separated into paragraphs.We sample 1M paragraphs from KILT and ensure that all the 20,958 Wikipedia passages in the SQuAD dataset are included in the corpus.We then explore the following five ways of polluting the clean corpus with human-created and synthetically-generated false documents.
• Polluted-Human.In Section 3.1, we asked human annotators to create a fake version for each passage in the SQuAD dev set.We inject those 2,023 fake passages into the clean corpus.
• Polluted-NER.We use BART-FG to generate 10 fake passages for each real passage in the SQuAD dev set, using NER to get candidate spans.We mask and re-generate all candidate spans to create each fake passage.Nucleus sampling (Holtzman et al., 2020) is used to ensure diversity in generation, giving us 18,233 non-repetitive fake passages in total.We inject them into the clean corpus.
• Polluted-Constituency.We generate 10 fake passages for each real passage using constituency parsing to get candidate spans in BART-FG.Since there are far more constituency phrases than named entities in a sentence, to ensure efficiency, we fix the number of replacements K = 3 for each sentence.We get 19,796 non-repetitive fake passages and inject them into the clean corpus.
• Polluted-Hybrid.We inject all of the abovegenerated fake passages into the clean corpus.
• Polluted-Targeted.In the above settings, the attacker (human or BART-FG model) tries to create misleading fake information without knowing the target questions.However, in another attack mode, attackers have particular questions of interest that they want to mislead the QA system into getting wrong answers.To explore how QA systems react to such attacks, in this setting we assume the attacker targets the questions in the SQuAD dev set.We then create fake passages by masking and re-generating the answer spans of these questions using BART-FG.Through this, we get 10,101 fake passages and insert them into the clean corpus.

Models and Experiments
We now how question answering models behave under such misinformation-polluted environment.To answer a given question, the QA systems employ a retrieve-then-read pipeline that first retrieves N (we set N = 5) relevant contextual documents from the evidence corpus and then predicts an answer conditioned on the retrieved documents.For document retrieval, we apply the widely-used sparse retrieval based on BM25, implemented with the Pyserini toolkit (Lin et al., 2021).For question answering, we consider five state-of-the-art QA models with public code that achieved strong results on the public leader board of SQuAD: RoBERTalarge (Liu et al., 2019), Span-BERT (Joshi et al., 2020), Longformer (Beltagy et al., 2020), ELEC-TRA (Clark et al., 2020), and DeBERTa-V3 (He et al., 2023).We use their model checkpoints finetuned on the SQuAD training set from the Hugging Face library.We use the standard Exact Match (EM) and F 1 metrics to measure QA performance.

Main Results
In Table 2, we show the performance of different QA models on the SQuAD dev set under the clean evidence corpus (Clean) and the performance under the misinformation-polluted corpus (Polluted).We have two major observations.For all models, we see a noticeable performance drop when generated fake passages are introduced into the clean evidence corpus: the smallest average performance drop is 7.72% in relative EM value (Polluted-Human), while the largest drop is .This indicates that QA models are sensitive to misinformation attack; even limited amounts of injected fake passages comprising 0.2% (Human) to 4.0% (Hybrid) of the entire corpus can noticeably affect downstream QA performance.It reveals the potential threat of misinformation to current QA systems, given the fact that they are not trained to differentiate misinformation.
Polluted-Targeted causes a more significant performance drop compared to the most effective question-agnostic attack (Polluted-Hybrid) (∼53% v.s.∼22% relative EM drop), indicating that QA models are more vulnerable under question-targeted misinformation attack.This reveals that the misinformation attack brings more threat when the attacker wants to alter the answers produced by QA systems for particular questions of interest.For the other four question-agnostic settings where the pollution is not targeted on specific questions, we still observe a noticeable EM drop (∼20%) for all models.Among them, Polluted-NER causes more performance drop than Polluted-Constituency, showing that generating misinformation by replacing named entities is more effective than replacing constituency spans.This is probably due to the nature of the SQuAD dataset, where most of the answer spans are named entities.

Impact of misinformation on retriever
The success of the misinformation attack relies on the premise that fake passages can be retrieved from the polluted corpus by the retriever.To validate this, we first define a fake passage P as the misleading evidence for the question Q if P contains a fabricated answer for Q.We then report in Table 3 the percentage of misleading evidence in the top-k retrieved passages (F@k, for k ∈ {1, 5}) for the BM25 retriever.We find that both F@1 and F@5 are very high, while the likelihood of the ground-truth true evidence appearing in the top-1 (R@1) and top-5 (R@5) decreases significantly for polluted corpus.The results show that the injected fake passages can be easily retrieved as evidence for downstream question answering.QA models, without the fact-checking capability, can thus be easily misled by such misinformation.
However, BM25 only relies on syntactic features and cannot be optimized for specific tasks.Is the misinformation attack also effective for trainable  dense retrievers?To explore this, we use ColBERT-V2 (Santhanam et al., 2022a), the state-of-the-art dense retriever that independently encodes the question and the passage using BERT and then employs a late interaction architecture to model their similarity.We use the ColBERT pretrained on MS-MARCO (Nguyen et al., 2016) and fine-tune it with (question, context) pairs from SQuAD training set as positive samples and (question, random context) as negative samples.The retrieval and QA performance are reported in Table 3.We find that misinformation attack also affects the ColBERT retriever, decreasing R@1 and R@5 for all settings, with high percentage of fake passages being retrieved as reflected by F@1 and F@5.The results also suggest that ColBERT is less resistant to misinformation attack compared to BM25.In the clean corpus, ColBERT outperforms BM25 in both the retrieval and the downstream QA performance.However, in all polluted corpus, the relative performance drop for ColBERT is larger than the drop for BM25.The possible explanation is: without the ability to identify fake information, a more "accurate" retriever tends to retrieve more seemingly relevant but false documents, making it less robust to misinformation attack.

Impact of the size of injected fake passages
As confirmation that misinformation attacks work as expected, we depict in Figure 3 how the De-BERTa model performance changes when different number of fake passages are injected into the evidence corpus.We find that the EM score steadily drops with more fake passages for both the question-targeted attack (Targeted) and the question-agnostic attack (NER).However, the former causes a much sharper trend of decrease, which further validates that misinformation attack is more deadly with a better knowledge of the target questions.Through this study, we conclude that misinformation may have a more severe impact on QA systems when they are produced at scale.With the availability of pretrained text generation models, producing fluent and realistic-looking contexts now has a little marginal cost.This brings an urgent need to effectively defend against neural-generated misinformation in question answering.
5.4 Which is more deceiving: human-or model-generated misinformation?
We then investigate which is more deceiving to QA models: human or neural misinformation?To study this, we let the QA model to answer each question Q under the context C = {P R , P H , P C , P N }, where P R is the real passage that contains the correct answer, and P H , P C , P N are the corresponding fake versions of P R produced by human, BART-FG (NER), and BART-FG (Constituency), respectively.We then analyze the source (which fake passage) of the incorrect answer when the model makes an error.If all three methods create equally deceiving fake passages, we expect to observe a uniform distribution of the error sources.The distribution of error sources in Figure 4 shows that the most wrong answers are extracted from the model-generated fake passage.Human-

DeBERTa-V3
Figure 4: Distribution of error sources when the model is misled by a fake passage and gives a wrong answer.created fake passages do not show an advantage over BART-FG in deceiving the QA models.This is counter-intuitive to what we find in Table 1 that humans make more subtle edits that require a deep level of reading comprehension, such as switching "former" and "latter" (Example 4), and changing "every day" to "every day but Sunday" (Example 3).A possible reason is that most questions in SQuAD are shallow in reasoning (Du et al., 2017).Therefore, replacing named entities/constituency phrases is sufficient in misleading QA models into getting the wrong answers for those questions.

Can misinformation deceive humans?
After showing the impact of misinformation attacks on QA systems, one natural question would be whether humans can also be distracted by misinformation during QA.To investigate this, we ran a study on Mechanical Turk where we presented crowd-workers with 500 randomly-sampled (question, context) pairs from the data in Section 5.4, i.e., each context consists of the real passage along with three fake passages created by different methods.We call this test set MisinfoQA-noisy and the workers are asked to answer each of its questions.For comparison, we create another test set MisinfoQAclean where each real passage is paired with three randomly sampled other Wikipedia passages.the fake contexts rather than by the presence of additional contexts.Humans obtained an EM of 69.13 in MisinfoQA-noisy, which, though higher than most QA models' performance, also shows a significant drop when compared to the MisinfoQAclean setting (86.57EM).This shows that humans are also likely distracted by misinformation in QA, which demonstrates the challenge of distinguishing misinformation in question answering for lay readers, the quality of the generated fake passages, and the difficulty of detecting such an attack.
6 Discussion and Future Work Finally, we discuss three possible ways to defend the threat of misinformation for QA.Knowledge source engineering.Despite being a trustful knowledge source, Wikipedia is insufficient to fulfill all the information needed in real-life question answering.Therefore, recent works (Piktus et al., 2021) started to use the web as the QA corpus.However, when transitioning to a web corpus, we no longer have the certainty that any document is truthful.Therefore, the corpora will require more careful curation to avoid misinformation.This also brings the need for future retrieval models to have the ability to assess the quality of the retrieved documents and prioritize more trustworthy sources.
Integrating fact-checking and QA.With the rise of misinformation online, automated fact-checking has received growing attention in NLP (Guo et al., 2022).Integrating fact-checking models into the pipeline of open-domain QA could be an effective countermeasure to misinformation, a direction neglected by prior works.A possible way is to detect potential false claims in retrieved contexts and lower their importance in downstream QA models.
Reasoning under contradicting contexts.It is common for humans to deal with contradictory information during information search.With the presence of inaccurate and false information online, future models should focus on the ability to synthesize and reason over contradicting information to derive correct answers.

Conclusion
In this work, we evaluate the robustness of opendomain question-answering models when we contaminate the evidence corpus with misinformation.We studied two representative sources of misinformation: human-written disinformation and the misinformation-generated NLG models.Our studies reveal that QA models are indeed vulnerable under misinformation-polluted contexts.We also show that our BART-FG model can produce fake documents at scale that are as deceptive as humans.This poses a threat to current open-domain QA models in defending neural misinformation attacks.

Limitations
We identify two main limitations to our study.First, although SQuAD is a typical dataset for evaluating open-domain QA models, most of the SQuAD questions are factoid and shallow in reasoning, making it relatively easy to generate misinformation targeted at SQuAD.Our results show that BART-FG with named entity replacement can generate fake passages as deceptive as humans.However, the impact of model-generated misinformation may be over-estimated on the shallow factoid questions in SQuAD.Therefore, more QA datasets should be considered in future works, especially non-factoid questions with deeper reasoning.Second, this work creates misinformation by revising key information of real articles in Wikipedia.However, there are other types of misinformation in the real world, such as hoaxes, rumors, or false propaganda.However, our proposed attack model can be easily generalized to study the threat of misinformation in other domains and in other forms.

Ethics Statement
We plan to publicly release the human-and modelgenerate fake documents and open-source the code and model weights for our BART-FG model.We note that open-sourcing the BART-FG model may bring the potential for deliberate misuse to generate disinformation for harmful applications.The human-written and model-generated fake documents can also be misused to generate disinformation.We deliberated carefully on the reasoning for open-sourcing and share here our three reasons for publicly releasing our work.
First, the danger of BART-FG in generating disinformation is limited.Disinformation is a subset of misinformation that is spread deliberately to deceive.Although we utilize the innate "hallucination" ability of current pretrained language models to create misinformation, our model are not specialized to generate harmful disinformation such as hoaxes, rumors, or false propaganda.Instead, our model focuses on generating conflicting information by iteratively editing the original passage to test the robustness of QA to misinformation.
Second, our model is based on the open-sourced BART model, which makes our model easy to replicate even without the released code.Given the fact that our model is a revised version of an existing publicly available model, it is unnecessary to conceal code or model weights.
Third, our decision to release follows the similar stance of the full release of another strong detector and state-of-the-art generator of neural fake news: Grover (Zellers et al., 2019)  5 .The authors claim that to defend against potential threats, we need threat modeling, in which a crucial component is a strong generator or simulator of the threat.In our work, we build an effective threat model for QA under misinformation.Followup research can build on our model transparency, further enhancing the threat model.

Figure 1 :
Figure 1: Our framework injects human-created and model-generated misinformation documents into the QA evidence repository (left) and evaluates the impact on the performance of open-domain QA systems (right).

Figure 2 :
Figure 2: Overview of the BART-FG model, illustrated by an example sentence.

Figure 3 :
Figure 3: The EM score for DeBERTa-V3 model with different number of injected fake passages N .

Table 1 :
6)On the whole, Eisenhower's support of the nation's fledgling space program was officially modest until the Soviet launch of Sputnik in 1957 , gaining the Cold War enemy enormous prestige around the world.On the whole, Eisenhower's support of the nation's fledgling MK Ultra was officially terminated until the Cuban missile crisis , gaining the Cold War enemy enormous admiration in less developed nations.Examples of original passages and their corresponding fake versions, where the information changes are highlighted.These examples represent six common types of created misinformation.

Table 2 :
Effects of different modes of misinformation attacks on the open-domain QA performance in SQuAD.

Table 3 :
Effects of different modes of misinformation attacks on the BM25 and ColBERT-V2 retrievers.

Table 4 :
Table4reports the EM and F1 for both human and different QA models.We find that all QA models suffer a large performance drop (∼20% in EM) in MisinfoQA-noisy compared to MisinfoQA-clean, showing that the models are largely distracted by QA performance under the reading comprehension settings with clean and noisy contexts.