FiD-Ex: Improving Sequence-to-Sequence Models for Extractive Rationale Generation

Natural language (NL) explanations of model predictions are gaining popularity as a means to understand and verify decisions made by large black-box pre-trained models, for tasks such as Question Answering (QA) and Fact Verification. Recently, pre-trained sequence to sequence (seq2seq) models have proven to be very effective in jointly making predictions, as well as generating NL explanations. However, these models have many shortcomings; they can fabricate explanations even for incorrect predictions, they are difficult to adapt to long input documents, and their training requires a large amount of labeled data. In this paper, we develop FiD-Ex, which addresses these shortcomings for seq2seq models by: 1) introducing sentence markers to eliminate explanation fabrication by encouraging extractive generation, 2) using the fusion-in-decoder architecture to handle long input contexts, and 3) intermediate fine-tuning on re-structured open domain QA datasets to improve few-shot performance. FiD-Ex significantly improves over prior work in terms of explanation metrics and task accuracy on five tasks from the ERASER explainability benchmark in both fully supervised and few-shot settings.


Introduction
While large pre-trained language models (Devlin et al., 2019;Raffel et al., 2019;Lewis et al., 2020) with hundreds of millions of parameters have made super-human performance possible on various NLP datasets, they lack transparency into their decision making process, which can adversely affect user trust in their predictions. Recent works have proposed the use of natural language (NL) rationales (Lei et al., 2016;DeYoung et al., 2020;Latcinnik and Berant, 2020) as a means to either obtain an understanding of the reasoning process of models, or Figure 1: Example questions, answers and corresponding passages from the BoolQ and MultiRC datasets from the ERASER benchmark (DeYoung et al., 2020). Annotated rationales are highlighted. Note that rationales can be multi-sentence and non-contiguous. as a human-readable snippet for users to verify predictions (Lipton, 2018). Figure 1 presents examples of extractive textual rationales for two QA tasks from the ERASER benchmark (DeYoung et al., 2020) 2 . Recently, Narang et al. (2020) show that sequence to sequence (seq2seq) models outperform previous methods at generating textual rationales for various explainability benchmarks. However, seq2seq models can fabricate rationales even for wrong predictions, are hard to scale to datasets involving several, long evidence documents, and, require large amounts of expensive rationale annotated data for training. In this paper, we introduce FiD-Ex, to alleviate these problems and enhance seq2seq models to achieve significant gains in rationale generation performance. Camburu et al. (2020) find that models that generate free-form NL explanations can tailor them to convincingly justify incorrect model predictions, for example, generating "There is no dog in the image" to justify an no prediction on the image of a dog. Although recent seq2seq models (Narang et al., 2020) obtain state of the art performance on rationale generation benchmarks, they are vulnerable to having similar behaviours and can hallucinate new facts by tapping into stored world knowledge in the language model parameters. In order to retain their effectiveness and yet, alleviate the problem of explanation fabrication, FiD-Ex introduces the novel use of sentence markers into pre-trained seq2seq models. Training seq2seq models to decode sentence marker tokens instead of explanation tokens not only guarantees the production of unaltered rationales but also significantly improves explanation metrics on five datasets (Section 7).
Fine-tuning pre-trained models on data-rich intermediate tasks before fine-tuning on classification end tasks has recently been shown to improve endtask performance (Vu et al., 2020;Pruksachatkun et al., 2020), more so in the few-shot setting. We find that this method also extends to seq2seq models, for explanation generation. We fine-tune pretrained seq2seq models to extract supporting evidence for existing open-domain QA datasets such as Natural Questions (Kwiatkowski et al., 2019) and HotpotQA (Yang et al., 2018), which then improves downstream performance on rationale extraction benchmarks. This approach is motivated by the similarity of the process of gathering supporting facts for QA, to that of rationale extraction for classification tasks. While earlier works on rationale generation (Paranjape et al., 2020;Narang et al., 2020) are limited by the input passage size of pre-trained models and resort to input-passage truncation, FiD-Ex uses the Fusion-in Decoder (FiD) approach (Izacard and Grave, 2020), that separately encodes chunks of long passages and fuses them in the decoder, which further improves performance. We combine these methods described above to develop FiD-Ex (Extractive Fusion-in-Decoder). To summarize, FiD-Ex significantly improves upon the performance and trustworthiness of seq2seq models for rationale generation by 1) reducing their ability to fabricate explanations using sentence markers, 2) extending them to very long input passages, and, 3) intermediate fine-tuning on re-structured existing QA datasets. When applied to the ERASER datasets (DeYoung et al., 2020), a popular benchmark for rationale extraction, FiD-Ex yields significant gains on multiple tasks in terms of explanation metrics: an absolute token-F1 gain of 12.7% on Boolean Question Answering (BoolQ), 33.2% on MovieReviews, 5.3% on Evidence Inference, 2.8% on FEVER, and 2.1% on MultiRC, along with modest gains in terms of task accuracy, over prior work.

Related Work
Deep learning models typically function as black boxes offering very little insight into their decision making mechanics. To expose model understanding at various depths, researchers have proposed various structural probing (Tenney et al., 2018;Hewitt and Manning, 2019;Lin et al., 2019) and behavioral probing methods (McCoy et al., 2020;Goldberg, 2019;Warstadt et al., 2019;Ettinger, 2020), as well as input saliency maps to highlight the most important tokens/sentences in the input for each prediction (Serrano and Smith, 2019;Ribeiro et al., 2016;Swanson et al., 2020;Tenney et al., 2019), and input token relationships (Lamm et al., 2020). Alongside, there is work on producing textual rationales (Lei et al., 2016), which are snippets of NL to help explain model predictions. Models may take a pipelined approach, where rationales are first selected as the sole inputs to the prediction stage, either in a supervised (Lehman et al., 2019;Pruthi et al., 2020) or an unsupervised (Paranjape et al., 2020;Bastings et al., 2019;Jain et al., 2020) fashion. Alternatively, rationales can also serve as post-hoc supporting evidence, produced after the model prediction, as a snippet to help users verify the prediction (Yang et al., 2018;Thorne et al., 2018). In this work, we improve upon seq2seq models to produce the latter kind of NL explanations, along with model predictions.
In addition to extractive NL rationales obtained from subsequences of the input text, there is recent work on generating abstractive textual explanations for NLP tasks such as commonsense QA (Rajani et al., 2019) andNLI (Camburu et al., 2018;Kumar and Talukdar, 2020). Latcinnik and Berant (2020) train language models to transparently output their world knowledge as NL tokens, which is then consumed by a light-weight classifier. Narang et al. (2020) use a generative seq2seq T5 model to produce NL explanations token-by-token for the extractive ERASER benchmark, in order to take advantage of multi-task training, i.e., training for task prediction alone, or jointly with explanations if available. Unlike strict input attribution based

FiD
... S1 The early Jain scholar Namisadhu acknowledged the difference , but disagreed that the Prakrit language was a corruption of Sanskrit .
S2 Sanskrit belongs to the Indo -European family of languages . Figure 2: Fusion-in-Decoder architecture for rationale prediction. Each sentence from the passage is marked with sentence markers S1 ... SN. The passage is broken up into C contexts/chunks, which are passed to the encoder. The decoder then attends to the C concatenated and encoded passages to generate the output sequence. The output sequence is the classification token followed by rationale sentence markers. methods that seldom produce human readable explanations, these models can provide users with more context, keeping with the style of explanation annotations in standard benchmarks such as ERASER. However, such models are susceptible to fabricating explanations to justify even their incorrect predictions, as identified by Camburu et al. (2020) and Wiegreffe et al. (2020). We introduce sentence markers into seq2seq models which alleviates this problem and also significantly improves their rationale extraction performance on sentencelevel ERASER benchmark tasks (see Section 4.2).
Multiple prior works (Paranjape et al., 2020;Jain et al., 2020;Narang et al., 2020) have explored methods to improve few-shot rationale generation, to reduce reliance on expensive rationale annotations. We fine-tune FiD-Ex on re-structured intermediate QA datasets to improve its regular and few-shot performance for rationale extraction. Fine-tuning large pre-trained models on intermediate tasks has been shown to be effective by prior work; Phang et al. (2018) use data-rich intermediate NLI tasks to improve target classification tasks; Talmor and Berant (2019) fine-tune on multiple QA datasets to improve the generalizability of QA models. Intermediate fine-tuning (IFT) can also hurt performance (Bingel and Søgaard, 2017). Pruksachatkun et al. (2020) recently present a largescale study on fine-tuning a pre-trained RoBERTa model on 100 intermediate-target task combinations and use 25 probing tasks to understand the most desirable properties of intermediate tasks and datasets. Vu et al. (2020) explore transferability between 33 NLP tasks and using task embeddings to predict the utility of intermediate tasks, they con-clude that intermediate tasks requiring high levels of reasoning and inference abilities are more likely to help, particularly when task data is scarce. Closest to our method is Kung et al. (2020) who use Squad 2.0 as an intermediate task to fine-tune a shared encoder fitted with task-specific classification heads, for the downstream BeerReview and MovieReview rationalization tasks. Our approach is to strategically restructure large open domain QA datasets (Natural Questions and HotpotQA) to make them amenable to IFT of both the encoder and the decoder of pre-trained seq2seq models. This enables the use of exactly the same model architecture for multiple rationale prediction tasks.

Modeling
In this section, we develop FiD-Ex, which improves upon seq2seq approaches to jointly produce NL rationales along with model predictions. We illustrate our method using the BoolQ dataset from the ERASER explainability benchmark, which comprises of questions with passages and boolean answers (see Figure 1), together with human annotated rationales (details in Section 4).
Formally, given an input query q and an input passage p comprising sentences p = {s j } N j=1 , our goal is to produce a prediction y and rationale sentences {e k } K k=1 , e k ∈ p, K N , that justify y. Narang et al. (2020) fine-tune the pre-trained T5 (Text-to-Text Transfer Transformer) model (Raffel et al., 2019) to auto-regressively produce the prediction and the explanation in a token-by-token fashion. Specifically, their model takes an input of the form "explain {task-name}: q p", represented as a sequence of subword units (Sennrich et al., 2016) using SentencePiece (Kudo and Richardson, 2018), and is trained to auto-regressively maximize the likelihood of an output sequence represented as "{prediction} explanation: e 1 · · · explanation: e K ". For example, an input from the BoolQ dataset (Clark et al., 2019) might be represented as "explain boolq: Is Sanskrit the first language of the world? <passage-tokens>", with the output represented as "False explanation: Sanskrit belongs to the Indo-European family of languages. explanation: It is one of the three ..." Such a model can be trained on data, both with and without explanation annotations, by dropping the unavailable parts of the output sequence. This model achieves state-of-the-art explanation performance on several ERASER tasks and serves as a strong baseline which we build upon. Narang et al. (2020), as well as other works (Camburu et al., 2020), point out that seq2seq models can fabricate reasonable sounding rationales to justify their incorrect predictions. To alleviate this issue, we introduce sentence markers into the input and output to enable the model to learn to generate a rationale sentence as a single unit. This technique has the added benefit that the rationales produced by the model are guaranteed to be strictly extractive at the sentence level, while retaining the performance benefits of a seq2seq architecture. Specifically, we preprocess the input passage p by prefixing each sentence s i with a sentence marker token S{i}. We also train the decoder to output the special sentence marker tokens, instead of NL tokens. Thus, the input is represented as "question: q passage: S1 s 1 S2 s 2 · · · SN s N " and the output as "False explanation: S e 1 · · · explanation: S e K ", where S e K is the marker for e K . The example from BoolQ would be represented as "explain boolq question: Is Sanskrit the first language of the world passage: S1 <Sent-1> ... SN <Sent-N>" and the output as "False explanation: S2 explanation: S3". Note that these markers are injected as NL text, and would be later split into sub-word units. During inference, sentence markers are produced and mapped back to the corresponding sentences from the input.

Fusion-in-Decoder Approach
Current approaches typically truncate p to 512 or 1,024 tokens, which is particularly limiting for passages from datasets such as BoolQ, which use very long input passages (> 3000 tokens). To accommo- We also compare their passage lengths in terms of number of input tokens and sentences.
date longer input passages, both for intermediate fine-tuning (see Section 3.3) and target fine-tuning, we use the Fusion-in-Decoder (FiD) architecture of Izacard and Grave (2020) as a replacement for the single encoder-decoder model of Narang et al. (2020). Using FiD, we break p into smaller chunks and encode each chunk independently using the pretrained T5 encoder (see Figure 2). This expands the effective input length of the encoder, and at the same time, keeps computation resources growing linearly with the number of passages as opposed to quadratically. These separately encoded representations are then fused in the decoder, which then attends to all passage tokens, when producing output tokens. For encoding, we concatenate the query q with each chunk of the input passage p. Further, we also prefix query and context tokens with special tokens, "question:" and "passage:" respectively. Making use of additional context from the passage, without truncation, significantly improves performance on the intermediate fine-tuning tasks as well as on the BoolQ, Movie Reviews and Evidence Inference end tasks (see Table 2). If using sentence markers, they are added to the passage before subdividing into multiple chunks.

Intermediate Fine-tuning (IFT)
Since obtaining rationale annotations for datasets is expensive, we look to fine-tune on existing large datasets to improve target task performance, particularly in the few-shot setting. Specifically, we re-structure open-domain QA (ODQA) datasets with answer span annotations to follow the same input-output structure as our target tasks, i.e., we produce a dataset of (query q, passage p, prediction y, and extractive rationales e) tuples from existing ODQA datasets. The datasets, together with their specific re-structuring methods, are described in Section 4. We present experiments where we first fine-tune FiD-Ex on a combination of multiple ODQA datasets, and finally, fine-tune on our target evaluation task, in Section 7. Alternatively, when multiple annotated datasets are available, we can possibly train a universal single model on the combined datasets, that works for all evaluation tasks. We explore this in Section 7.2.

Datasets
In this section, we discuss the open-domain QA datasets and our pre-processing steps to prepare them for IFT, as well as, the ERASER rationalizing datasets that we use for evaluation. Table 1 presents the sizes of each dataset split, as well as the average input passage lengths, in terms of the number of tokens and sentences, for both types of datasets.

Intermediate Fine-Tuning Datasets
Natural Questions (NQ) (Kwiatkowski et al., 2019) comprises real Google search queries with answer-span annotations from Wikipedia pages. Following Lee et al. (2019) we use a subset containing short answers (< 6 tokens). For every question and answer-span annotation, we use the question as q, the segmented Wikipedia passage as p, the answer tokens as the prediction y, and the single sentence containing the answer span as the rationale e. We remove all tables and lists from the Wikipedia passages, but retain section headers.
HotpotQA (Yang et al., 2018) is a multi-hop QA dataset, where each question and answer annotation is accompanied with supporting fact sentence annotations from multiple Wikipedia documents. Similar to NQ, we use the question as q and the answer tokens as the prediction y. Since there are multiple Wikipedia evidence pages, we treat each page as a separate passage p and aggregate the annotated rationale sentences from it as the rationales e. Thus, a single HotpotQA (question, answer) tuple produces as many examples as Wikipedia pages that are part of its supporting facts.

Evaluation Data
We evaluate on a subset of the datasets from the ERASER benchmark (DeYoung et al., 2020), which comprise an input query and passage, an output class label, and input sentences annotated as rationales. We discuss these datasets in this section.
BoolQ (Clark et al., 2019) comprises questions, whose answer can be either True or False, paired with long Wikipedia passages (> 3,000 tokens), as well as sentence-level rationale annotations (provided by ERASER) that support the answer.
MultiRC (Khashabi et al., 2018) comprises input passages and questions, with multiple-choice answers, with sentence level rationale annotations. It is evaluated as a Boolean QA task by concatenating each answer choice to the question, and assigning a True label to correct choices and False to the rest. All choices use the same set of supporting facts.
MovieReviews (Movies) (Zaidan and Eisner, 2008;Pang and Lee, 2004) contains movie reviews paired with binary positive/negative labels, without a query q (we set it to "What is the sentiment of this review?" in our models). While ERASER provides span-level rationale annotations, we translate these to sentence level annotations following prior work (Paranjape et al., 2020). FiD-Ex can also potentially be trained to output extracted input phrase markers and we leave this to future work.
FEVER (Thorne et al., 2018) The ERASER version of FEVER contains input passages along with claims (q) that must be classified as supported or refuted, based on the passage, together with sentencelevel rationale annotations from the input passage.
Evidence Inference (EVI) (Lehman et al., 2019) comprises (intervention, outcome, comparator) triples (concatenated as q) together with randomized controlled trial articles (> 4,000 tokens), with the prediction being whether the intervention significantly increases, decreases, or has no effect on the outcome with respect to the comparator of interest. ERASER provides sentence-level supporting facts on a subset of this dataset.
We do not evaluate on the ERASER datasets of e-SNLI and CoS-E since they only use singlesentence input passages.

Evaluation Metrics
We report Exact Match Accuracy (EM) in terms of exact token match between the predicted class label and the true label, which is equivalent to traditional classification accuracy. To evaluate the explanation quality, we report the following: Rationale F1 (RF1) is an F1 score over the set of predicted explanation sentences as compared to the set of gold explanation sentences, computing set intersection based on exact sentence match.
Token F1 (TF1) is a token level F1 score between the predicted explanation sentence tokens and the gold explanation sentence tokens, in terms of sets of token positions, by first mapping tokens to token positions in the input passage. This is computed exactly as in Narang et al. (2020), using spaCy for tokenization. When using sentence markers, we map the markers back to the original sentences before computing TF1.
Intersection over Union (IOU F1) as described in DeYoung et al. (2020), is computed by first matching up each predicted rationale with a gold rationale, and then computing F1. IOU is similar to RF1, except that it does not use exact match. A prediction and gold sentence match if the size of the overlap of their token positions divided by the size of the union of the token positions is higher than a threshold (we use 0.5). For our models, IOU F1 is very similar in magnitude to RF1.
Other Metrics We do not use human evaluation scores since Narang et al. (2020) found them to be much higher than the automated metrics, and therefore, hard to interpret, in addition to being expensive and noisy. Also, since we aim to provide users with evidence for model predictions, causal faithfulness metrics such as comprehensiveness and sufficiency (DeYoung et al., 2020), do not apply.

Implementation Details
We use the FiD (Izacard and Grave, 2020) model architecture with T5-base (220M params). We use 1024 input sub-word tokens per context for Mul-tiRC and 512 for the rest. We use a maximum context size of 10 for BoolQ and EVI, and 6 for Movies. We use data distributed training on machines with 8 32-GB GPUs with a batch size of 8 per GPU. We train all models for 20,000 steps using Adam (Kingma and Ba, 2014), with learning rates chosen from {1e −4 , 1e −5 } based on dev performance and use linear decay. We compute dev metrics every 500 steps and select the model with the best TF1 score. We use greedy decoding for the prediction and the explanation. The above settings are used, both for IFT as well as for end-task fine-tuning. For segmenting Wikipedia passages into sentences for NQ, we use Punkt (Kiss and Strunk, 2006) for English from nltk. For our evaluation datasets, we used the pre-segmented and pre-tokenized input passages provided by ERASER.

Results and Discussion
We compare the performance of different variants of our FiD-Ex model using all evaluation metrics on five ERASER datasets, in  Increased Passage Size Using FiD's multiple context encoders instead of the input truncation methods of prior work, helps significantly improve performance. When also using sentence markers, BoolQ TF1 improves by 7.2%, Movies by 8.4% and EVI by 21.1%. This is accompanied by task EM gains of 8.5% in Movies and 12.1% in EVI. Input passages in MultiRC and FEVER are not long enough to benefit significantly from increased passage size. The gains from increasing passage size are orthogonal to the gains by sentence markers, i.e., explanation metrics improve with additional context with or without using sentence markers (Table 2). Similarly, sentence markers improve performance for both single and multi-contexts.
Intermediate Fine-tuning (IFT) and Few-shot Performance We perform IFT using sentence markers on a combined dataset of NQ and Hot-potQA, re-formatted for rationale extraction tasks. Final fine-tuning on the full training sets of our evaluation tasks improves TF1 by 1.4% for BoolQ and 1.2% for EVI. To evaluate IFT in the few-shot setting, we fine-tune using 25% data for the BoolQ, Movies, and EVI tasks following Paranjape et al.
(2020) and 2,000 examples for tasks with bigger datasets, viz., MultiRC and FEVER. We see an improvement of 3.2% TF1 on BoolQ and 1.2% on EVI. This is desirable since obtaining labeled rationale annotations is expensive. We do not observe any performance improvement for Movies, Mul-tiRC, and FEVER with IFT. While our few-shot experiments used 25% data to compare with prior work, IFT may show more marked improvements with just 10-100 examples. While IFT on NQ or HotpotQA alone improves performance, we find that combining the datasets yields best results.

Comparison with Prior Work
In Table 3 we compare our best fully supervised model for each dataset, with prior works that share the best performance on ERASER tasks: Bert-to-Bert (B2B) is the supervised pipeline of DeYoung et al. (2020) that comprises an independently trained rationale extractor, and an answer prediction model on the extracted rationales.   (2020), which jointly trains an explainer that predicts sparse binary masks over input sentences, and a prediction model on the residual sentences. Although they only report supervised results using 25% training data, their model achieves similar performance even with 100% training data.

Universal Model
With the goal of deploying one single model that can perform all 5 ERASER tasks, we train a model on their combined training sets, with SM and C = 10, and evaluate on each test set (see Table  1). Each training example is prefixed with a token denoting the dataset that it came from as described in Section 3. Despite the lack of individual finetuning, this universal model outperforms the best fine-tuned models by 4% on FEVER and is within ±1% of the best model performance on the other datasets. Training on a large combined dataset of related tasks, when available, reduces reliance on IFT to improve performance (which primarily benefits only EVI in this scenario). Overall, this result highlights a key advantage of the seq2seq format, that naturally enables effective data sharing among multiple related tasks (Raffel et al., 2019).

Error Analysis
We conduct an error analysis on predictions from our best FiD-Ex model on 50 random examples from the valid set of BoolQ, which have nonperfect RF1 score. (Table 4). The two largest error types are: 1) Overlap and Adequate (36%): the set of predicted explanations is adequate by itself and overlaps with the true explanations, i.e., the true explanation set contains redundancies, and 2) Over-prediction (30%): the set of predictions is a strict superset of the true explanations. Other sources of errors are Overlap and Inadequate (4%) -when the predictions are inadequate but overlap with the true explanations, No-overlap and Adequate/Inadequate -when the predictions have no overlap with the true explanations and are either still adequate (8%) or inadequate (12%). Since ERASER provides only one of the multiple possible explanation sets, 8% non-overlapping predictions happen to be adequate. Prediction not in input (4%) -when sentence markers that do not exist in the input are predicted, and Input Truncated (6%) -when the true explanation sentences are truncated out of the model input, which still happens for very long inputs even with a context size of 10. We present illustrative examples of these error cases in the Appendix. Promising focus areas for future work include addressing model tendencies for over-prediction (30% of cases) and inadequate non-overlapping predictions (12% of cases).

Conclusion
In this paper, we develop general methods to improve the performance of large pre-trained seq2seq models for jointly producing NL rationales and answer predictions. Specifically, we introduce sentence markers into seq2seq models to tackle explanation fabrication, we enable larger input passage sizes using the Fusion-in-Decoder architecture, and we infuse knowledge by fine-tuning on restructured QA datasets. We show that a universal model can perform favourably compared to the best task-specific fine-tuned models. Our methods improve the state of the art on rationale extraction metrics and task accuracy on multiple ERASER benchmarks while reducing the extent to which seq2seq models fabricate explanations to justify incorrect predictions, thereby improving the reliability and verifiability of the generated rationales.

Overlap and Inadequate
Question: are all mass air flow sensors the same Gold Answer: False Predicted Answer: False Gold Rationales: ['S4', 'S5', 'S6', 'S7', 'S8'] Predicted Rationales': ['S0', 'S1', 'S2', 'S3', 'S4'] Document: 'S0 MASS FLOW SENSOR S1 A mass ( air ) flow sensor ( MAF ) is a sensor used to determine the mass flow rate of air entering a fuel -injected internal combustion engine . S2 The air mass information is necessary for the engine control unit ( ECU ) to balance and deliver the correct fuel mass to the engine . S3 Air changes its density with temperature and pressure . S4 In automotive applications , air density varies with the ambient temperature , altitude and the use of forced induction , which means that mass flow sensors are more appropriate than volumetric flow sensors for determining the quantity of intake air in each cylinder . S5 There are two common types of mass airflow sensors in use on automotive engines . S6 These are the vane meter and the hot wire . S7 Neither design employs technology that measures air mass directly . S8 However , with additional sensors and inputs , an engine ś ECU can determine the mass flow rate of intake air . . . . S103 REFERENCES EXTERNAL LINKS * A Hot Film sensor with theory of operation * A video example of cleaning a MAF sensor * An example of how to clean a MAF sensor , S104 3 wire S105 * How To Test a MAF Over − P rediction  Table 5: Fraction of explanation sentences produced by the WT5 model (Narang et al., 2020) that do not exactly match to any input sentence (with C=1), on five ERASER datasets.
In Table 6, we present an example of an explanation fabricated by WT5. The sentence in the output of WT5 is fabricated, and an alternate sentence with the same beginning exists in the input passage.
Input Movie Review: note : some may consider portions of the following text to be spoilers . be forewarned . " all the world 's a stage and all the men and women merely players they have their exits and their entrances and one man in his time plays many parts " -excerpt from as you like it , act ii , scene 7 when william shakespeare penned this passage , he could not have possibly envisioned a world in which the domestic activites in an abode would be broadcast across the continent , or where women would install webcams in their apartments in order to convert voyeurism into cash . ... for the first time in his life , it is beginning to dawn on truman that things are not what they appear to be . ...
Output explanations by WT5 ... for the first time in his life, truman shows some of the oddest, most upbeat and affectionate attitudes towards women and men in general .... Table 6: An example of explanation fabrication by the WT5 model (Narang et al., 2020) on an example from the MovieReviews dataset.