Controlling Text Edition by Changing Answers of Specific Questions

In this paper, we introduce the new task of controllable text edition, in which we take as input a long text, a question, and a target answer, and the output is a minimally modified text, so that it fits the target answer. This task is very important in many situations, such as changing some conditions, consequences, or properties in a legal document, or changing some key information of an event in a news text. This is very challenging, as it is hard to obtain a parallel corpus for training, and we need to first find all text positions that should be changed and then decide how to change them. We constructed the new dataset WikiBioCTE for this task based on the existing dataset WikiBio (originally created for table-to-text generation). We use WikiBioCTE for training, and manually labeled a test set for testing. We also propose novel evaluation metrics and a novel method for solving the new task. Experimental results on the test set show that our proposed method is a good fit for this novel NLP task.


Introduction
In many cases, we need to change some specific content in a document. For example, in the legal domain, the items and conditions in contract documents often need to be revised many times. We would like to use artificial intelligence to conduct this process for human editors. A major difficulty of this process is that the machine learning model should decide where to edit and how to edit.
Usually, the place of specific content ("where to edit") can be located by a question, and the content updating ("how to edit") can be determined by the answer of the question. Therefore, in this paper, we propose the new task of controllable text edition (CTE). In this task, we would like to achieve the following goal: adjust some content of a document D, to make the answer A of a documentrelated question Q changed to a new answer A ′ . The question Q to D has an answer A (in red; its rationale in D also in red). If we would like to change the answer to the new answer A ′ (in blue), then we have to change some content in D, yielding the modified text D ′ (with the new content in blue) in the lower box.
For example, in Fig. 1, when we change the red part of the original text to the blue part, the answer of the question turned to the new answer as a consequence.
There are three main challenges in this task: (1) The machine learning model should decide the positions that need to be changed in the document. Usually, finding the answer positions for a given document-related question is similar to extractive machine reading comprehension tasks (Zeng et al., 2020), which requires to fully understand both the question and the document. Nearly all extractive machine reading tasks, such as SQuAD (Rajpurkar et al., 2016(Rajpurkar et al., , 2018 and CNN/Daily Mail (Hermann et al., 2015), focus on extracting one span from the document as answer. Differently from extractive machine reading, in our task, the answer A is not necessarily a substring of the document, and there may exist multiple positions that have to be changed. Therefore, our task is much more challenging than extractive machine reading.
(2) The model should generate a new document that supports the new answer A ′ for question Q.
Note that this cannot be solved by directly replacing the original words in the edit positions with the new answer A ′ , because the new answer may not fit perfectly with the document, which would make the document disfluent.
(3) There are nearly no parallel data for model training, because obtaining a large annotation set for this task is very hard. 1 However, the model may be trained by lists of triples Q, D, A that can be obtained from datasets in machine reading and/or structured data extraction (as described below).
In this paper, we introduce and define the task of controllable text edition (CTE). We propose to transform the WIKIBIO dataset (Lebret et al., 2016) into a list of triples Q, D, A for training. WIKIBIO was originally designed for table-to-text generation, in which each case is composed of a Wikipedia passage D and an infobox (which is a list of field, content 2 pairs). In detail, we take each "field" in the infobox as the question Q, and each "content" in the infobox as the answer A. Therefore, for each field, content pair, we can create a Q, D, A triple. After some pruning, we finally selected 26 different Q's and 141k Q, D, A triples for the training set, as well as 17.7k triples for the development set. We also annotated a small test set of about 1k data for evaluation in the form of Q, D, A, A ′ , D ′ (A ′ represents the new answer, and D ′ represents the ground-truth modified text). The resulting new dataset is called WIKIBIOCTE.
In addition, we propose a novel method, called Select-Mask-Generate (SMG), to solve the proposed CTE task. In this method, we use the selector-predictor architecture by  to select the answer-related tokens, and we then use complementary masks to split the text into an answer-related part and an answer-unrelated part. Then, we reconstruct the original text based on the answer-unrelated part and the original answer. The reconstruction process is a partial generation method, which only generates the masked-out part without any length limit. In our experiments, the SMG model has achieved the state-of-the-art performance, compared to baseline models in the generation of modified documents. The code and the test set WIKIBIOCTE are available online 3 .

Related Work
The proposed task of controllable text edition is related to the following existing tasks.

Attribute Disentanglement
Attribute disentanglement tends to control the attributes of a given text or image (such as sentiment, tense, syntax, or face pose) by disentangling different attributes into different subspaces. When transferring attributes, the content of the text/image needs to be preserved. Usually, disentanglement works can be divided into implicit and explicit disentanglement. Implicit disentanglement (Higgins et al., 2017;Chen et al., 2018;Moyer et al., 2018;Mathieu et al., 2018;Kim and Mnih, 2018) separates the latent space into several components in a purely unsupervised way, expecting that each component corresponds to an attribute. However, the number of components cannot be decided in advance, neither does the correspondence between attributes and components. Also, the training process may prune some of the components (Stühmer et al., 2019), which will hurt the interpretability of the latent space. Explicit disentanglement (Chen et al., 2016;John et al., 2019;Romanov et al., 2019; tends to separate the latent space into more interpretable components with explicit correspondence to specific attributes. Hence, it usually requires gold labels of attributes in the training set.
In comparison, our task tends to control the content of the text by tuning answers to text-related questions. Attribute disentanglement is difficult to be applied to our task, because the modification of the content should be decided by both the question and the answer simultaneously, which is much sparser than attributes.

Lexically Constrained Decoding
Lexically constrained decoding (Hokamp and Liu, 2017;Miao et al., 2019;Sha, 2020) directly controls the output of the generation model by adding constraints. Usually, the constraints include hard constraints (requiring the generated 3 https://sites.google.com/view/control-text-edition/home  sequence to contain some keywords) and soft constraints (requiring the generated sentence to have the same meaning to a given text).
The basic methods of lexically constrained decoding can be divided into enhanced beam search (Hokamp and Liu, 2017;Post and Vilar, 2018) and stochastic search (Miao et al., 2019;Liu et al., 2020;Sha, 2020). Enhanced beam search (Hokamp and Liu, 2017;Hasler et al., 2018; changes some strategies in beam search to make the process of searching for a constraint-satisfying sentence easier. However, for some tasks with an extremely large search space, beam-search-based methods may be computationally too costly or even fail (Miao et al., 2019). Stochastic search tends to edit an initial sentence step-by-step, where the editing position and action can be decided by Metropolis-Hastings sampling (Miao et al., 2019), a discrete scoring function (Liu et al., 2020), or gradient-based methods (Sha, 2020). However, lexically constrained decoding is hard to be applied to our task, because adjusting the text to fit a text-related question's new answer is much more complicated than simply satisfying a hard or soft constraint.

Text Editing and Infilling
In some tasks, to simplify the text generation problem, researchers tend to edit existing text or prototypes to obtain a refined text that satisfies some specific requirements. Examples are the generation of summaries by template-based rewriting (Cao et al., 2018; and the generation of text or a response by editing a prototype sentence Pandey et al., 2018;Wu et al., 2019). In (Yin et al., 2018), the distributed representations of edit actions are learned and applied to editing Wikipedia records (Faruqui et al., 2018) and Github code (Yin et al., 2018). Panthaplackel et al. (2020) further integrate a copy mechanism into text editing. Text infilling (Fedus et al., 2018) means to use machine learning models to fill the blanks of a cloze test. Zhu et al. (2019) propose a more general text infilling task, which allows an arbitrary number of tokens (instead of a single token) in each blank.
In the above text editing tasks, the goal of editing is always consistent among all the datasets: for a better summarization, a better response, or a better informative sentence. Differently from them, our proposed task requires the editing to be guided by the document-related answer of the question. So, each above case has a different editing goal. Thus, our task requires deciding where to edit according to the given question in the first step, and then deciding how to edit, which makes our task more complicated than all the above text editing tasks.

Dataset
We now formally define the task of controllable text edition and propose a dataset for this task.

Task Definition
The task of controllable text edition (CTE) is defined as follows. The input is a triple D, Q, A ′ , where D is a document, Q is a document-related question, and A ′ is an expected answer for Q to D. The output is D ′ , which is a minimal modification of D such that the answer for Q to D ′ is now A ′ . Note that the original answer of Q to D is A, but A is not an input to the task, and usually A = A ′ .

WIKIBIO as Controllable Text Editing Dataset
We propose to modify the WIKIBIO dataset (Lebret et al., 2016) to make it fit for our task. WIKIBIO was originally designed for table-to-text generation (Lebret et al., 2016;, which generates a celebrity's biography according to his/her basic information. Each example in the dataset is composed of a Wikipedia infobox and a text (the first paragraph in the Wiki page) describing the infobox as shown in Table 1.
In an inverse way, the WIKIBIO dataset can be taken as a question-answering dataset: each field can be taken as a question, and each content can be taken as an answer. For example, in Fig. 1, the field "Occupation" can be interpreted as question "What is the person's occupation?", and the corresponding content "Virology" is the answer.
Therefore, we take the text in WIKIBIO as the document (D) in our task, the field as the question (Q), and the content as the answer (A). Due to the huge cost of data annotation, the model needs to be trained without the changed answer (A ′ ) and the referenced document (D ′ ).
For the creation of the training and development sets, we count the frequency of fields and select the fields that occurred more than 5k times in WIK-IBIO's training set as candidate questions (Q's). Then, we filter out some Q's that do not have corresponding answers in D 4 . We then get a list of 26 different Q's as shown in Table 2. After filtering the Q's according to Table 2, we get 141k Q, D, A triples for the training set and 17.7k triples for the development set.
Then, we manually labeled a small test set in which each example contains (D, Q, A) as well as the changed answer (A ′ ) and the referenced document (D ′ ). The annotation process can be illustrated as follows: 1. We randomly sampled an equal number of examples for all the fields in Table 2. For each field, we sample ⌈ 1000 #F ⌉ cases (#F is the number of selected fields), to make sure that the size of the test set is around 1k.
2. We assigned a changed answer (A ′ ) to each example by randomly picking a similar phrase to the original answer (A). The similar phrase may occur in different examples, but it shares the same Q with the original answer (A).
3. We asked human data graders to give a modified text (D ′ ) for each example according to the original text (D), question (Q), and the changed answer (A ′ ). We asked two talented linguistics to annotate the 1k test set.
Note that there are also other datasets that are potentially able to be modified as  controllable text editing dataset, such as SQuAD (Rajpurkar et al., 2016), RACE (Lai et al., 2017), and MCTest (Richardson et al., 2013). We did not choose them for the following reasons: (1) For extractive machine reading tasks like SQuAD (Rajpurkar et al., 2016), the answers are simple substrings of the document, so that in most cases, the text modification in our task can be solved by a simple string replacement, which violates the goal of our task.
(2) Multiple-choice machine reading tasks like RACE (Lai et al., 2017) usually require full and deep reasoning of the whole document to get the answer, which would make the text modification in our task unable to be solved by partial modification. Differently from them, most contents (A) in WIKIBIO usually cannot be directly extracted as substrings from the document (D). Besides, the contents usually has some related information that should be modified at the same time. For example, if somebody is a pianist, then he/she may have received a piano award instead of a guitar award. Therefore, WIKIBIO satisfies the goal of our proposed task: making minimal changes to the original document to make it fit the changed answer (A ′ ).

Select-Mask-Generate (SMG) Method for Controllable Text Edition
We introduce the training and testing method of our proposed method. In the training phase, the model is trained to learn to recognize answerrelated (A-related) tokens and learn to fill newanswer-related (A ′ -related) tokens into the blanks after deleting answer-related tokens.

Training Phase
In the training phase, we only have Q, D, and A. So, we teach the model to (1) identify answerrelated information, and (2) be able to reconstruct D from A and (D − A p ) (the original text with all answer-related information masked out, where A p means the predicted answer-related tokens). The model architecture is shown in Fig. 2. Inspired by InfoCal , we use a Selector-Predictor architecture to identify the least-but-enough answer-related words in the original document (D). The main architecture of the Selector network is a BiLSTM model, which samples 5 a binary-valued mask (M ) for each input token (called answer mask), denoting whether to select this token as answer-related token (1) or not (0). Given an input document D = {x 1 , . . . , x n } and a question Q, the Selector samples an answer-related mask M = {m 1 , . . . , m n } as follows: where "Sel" represents the selector network. Then, we call the complement of the answer mask (M = 1−M ) as the context mask, and we denote context template as the token sequence after masking out the answer-related tokens.

Answer Reconstruction
We require that the answer-related information contains everything about the answer A, so we use an answer decoder to reconstruct an answer sequenceÃ. Then, we calculate the reconstruction loss L A as follows: where Dec A is the answer decoder, and p a is the sentence distribution generated by Dec A . Note that the input to Dec A is the average vector of the selected token vectors: the answerrelated tokens are usually very few, so it is not necessary to use heavier encoders like LSTMs (Hochreiter and Schmidhuber, 1997) or transformers (Vaswani et al., 2017).

Document Reconstruction
On the other hand, D should be reconstructed by the context template and the gold answer A. We use an LSTM encoder Enc D to encode the context 5 The sampling process is implemented by Gumbel Softmax (Jang et al., 2016), which is differentiable.  Figure 2: The architecture of our SME model. In the testing phase, we need to replace the input to the answer encoder from the gold answer A to the new answer A ′ , then the output of the context decoder will become the modified textD ′ .
tokens as shown in Eqs. 4 and 5: where h ′ 1 , . . . , h ′ n are the encoding vectors corresponding to each input token. We then take the averaged word vector of the input gold answer A, denoted V A , as an external condition of the decoder.
Differently from conventional decoders, our decoder only partially generates tokens to fill in the blanks of the context templates, as shown in Fig. 3. This brings two changes in the training phase: (1) we only need to calculate the loss caused by the tokens filled in the blank, and (2) the model needs to learn an external end-of-answer (EOA) token S eoa for each token filled in the blanks. The EOA token is very important because it is an indicator about when to stop filling the current blank.
Learning to generate the words. In each time step t of the decoder, we use an LSTM (Hochreiter and Schmidhuber, 1997) unit to predict the next word y t and the EOA token S eoa as follows: where h w and h eoa are hidden layers (the time step index t is omitted), F LSTM is an LSTM cell, F m , F w , and F eoa are linear layers, and s lstm t (w) is a scoring function that suggests the next word to generate. p(S eoa (t)) is the probability distribution of the EOA token.
Note that in the decoder, we use the copy mechanism , which encourages the decoder to generate words by directly copying from the input context sequence D and answer sequence A. The copy mechanism computes a copy score s copy t (w) for each word in D and A. Then, the generated probability of each word is computed as: Thus, the document D's reconstruction loss is as follows: where M ∼ Sel(M |D, Q), the mask m t is multiplied in each time step, because we only need the losses of blank-filling tokens.
Learning the end-of-answer (EOA) tags. We have an EOA tag for each blank-filling token. The EOA tag is 1 if the corresponding token is the last token in the blank. For the other blank-filling tokens, the EOA tag is 0. The gold EOA tag in each time step g eoa t can be computed by the difference between the previous answer mask m t−1 and the current answer mask m t . There are three possible values (−1, 0, and 1): g eoa t = 0 when the difference is −1 or 0, and g eoa t = 1 when the difference is 1. Then, we have the cross-entropy loss as Eq. 13: Therefore, the final optimization objective is shown in Eq. 14: where λ r and λ eoa are hyperparameters.

Inference Phase
In the inference phase, we take the new answer A ′ as the input to the context decoder instead of the gold answer A. Then, the output of the context decoder will become the modified textD ′ . We choose an autoregressive partial generation method for inference. Our partial generation method can fill the blanks with any-length phrases and can directly replace any decoder, which cannot be done by any existing alternative methods. For example, in the method using global context (Donahue et al., 2020), it is an pretrained language model by itself. However, in our architecture, the masks are decided by the selector module. Therefore, even the number and length of the blanks cannot be decided before training. So, the ground-truth target sequence for the finetuning of the pretrained language model would also be hard to decide. Therefore, the partial generation method is the best choice for our task.

Partial Generation
Since we already have a context template when we are generating the modified document, we only need to generate tokens to fill the blanks in the context template. The partial decoding process is shown in Fig. 3. We use an indicator state= 0 to denote the reading mode (reading the context template words), and state= 1 to denote the writing mode (generating the blank-filling words). The basic generating process is described as follows: when the model is reading the context template, if it meets a masked token, the mode turns to writing mode, and it starts to generate words to fill the current blank. When the EOA tag turns to 1, or the decoding length l g surpassed a limit l max , the mode turns back to reading mode. Note that this decoding process can generate an arbitrary number of words for each blank, and we can fill all blanks in a context template in a single decoding pass, which is much more efficient than MaskGAN (Fedus et al., 2018) and text filler (Zhu et al., 2019). The detailed algorithm is shown in Algorithm 1.

Experiments
In the experiment part, we proposed some specific evaluation metric for our controllable text edition task and then compare and analysis the performance of our proposed method (SMG) on the WIKIBIOCTE dataset.

Evaluation Metrics
For the evaluation of the modified documentD ′ , we use the following two automatic evaluation (1) BLEU (D ′ vs. D ′ ): This metric measures the BLEU score (Papineni et al., 2002) between the generated modified documentD ′ and the reference document D ′ .
(2) iBLEU (Sun and Zhou, 2012): This metric is previously widely used in evaluating paraphrase generation tasks (Liu et al., 2020;Sha, 2020). iBLEU is defined as: iBLEU = BLEU(D ′ , D ′ ) − αBLEU(D ′ , D) 6 , which penalizes the similarity between the modified documentD ′ and the original document D. The goal of this metric is to measure the extent to which the model directly copies words from the original document D without taking any content from A ′ .
(3) diff-BLEU ratio: diff-BLEU is a BLEU score computed betweenD ′ and a difference sequence between the gold modified document D ′ and the original document D. The difference sequence is obtained by masking out the longest common sequence between D and D ′ from D ′ . Since this maximum value of this BLEU score is the BLEU value between the gold modified document D ′ and the difference sequence, we use their quotient as the diff-BLEU ratio score as shown in Eq. 15: (4) Perplexity: This metric measures the fluency of the generated content-modified documentD ′ . We applied a third-party language model (Kneser-Ney language model (1995)) as the perplexity evaluator. We trained the language model on the whole training set of WIKIBIO, and use the trained model as the evaluation of fluency, where a lower perplexity value is better. Besides, we used human effort to evaluate two aspects of the content-modified documentD ′ . Correctness is an accuracy score from 0.0% ∼ 100.0%, which evaluates whetherD ′ has successfully turned the answer of question Q from A to A ′ . Fluency is from 0.0 ∼ 5.0, which evaluates whetherD ′ is fluent from a human being's view. The scoring details are in the supplemental materials.
Also, in our method, the selection of answerrelated words is very important, so we have two evaluations for the selection part: (1) BLEU (predicted template) is the BLEU score between the predicted template (the token sequence after we masked out the answer-related words from the text D) and the gold template (the common sequence of D and D ′ ).
(2) Answer F 1 measures the Bag-of-words (BOW) F 1 value of the generated answerÃ compared to   Table 4: Performance of answer-related words selection.
the gold answer A. This metric is difficult to achieve, because it requires both to select the correct answer-related tokens and to generate the correct words for the answer A.

Overall Performance
We compare our method (SMG) with a baseline method (Seq2Seq). In Seq2Seq, the difference with SMG is that the decoder part is a conventional decoder that completely generates the modified documentD ′ ignoring the context template.
The overall performance is shown in Table 3.
In Table 3, we see that our SMG method has outperformed the Seq2Seq baseline in nearly all evaluation metrics, no matter whether the context template applied to the decoding phase is gold or predicted. Especially, in the two most important metrics for the performance of controllable text edition: iBLEU and diff-BLEU ratio, our model has achieved a significantly higher score than competing methods. These results show that our method is effective in controllable text edition.
The human evaluation results are also listed in Table 3. The inter-rater agreements are all acceptable (> 0.85) due to Krippendorff's principle (2004). According to the human evaluation, when we are using the gold template for partially generating, both the correctness and the fluency of the partially generated textD ′ are better than using the predicted template, which is also consistent with our intuition. Note that the perplexity score and the fluency score of Seq2Seq are the best of all the three methods; this is because in the partially generated text, the end position of each blank may not fit very well with the next word sometimes, although we have trained an EOA tag. Table 4 shows the experiments evaluating the selection of answer-related words. We can see that our SMG model has a higher BLEU (predicted template) score than the Seq2Seq model. This fact shows that partially training the blank-filling tokens helps for the selection of answer-related tokens. Also, our model SMG has achieved a higher answer F 1 score (0.68) than competing methods.

Case Study
We have listed some examples of the modified documentD ′ generated by the three competing methods (Seq2Seq, SMG(g), and SMG(p)) in Table 5. We can see that although the answer-related words are already masked out, Seq2Seq still always generates the words in the original answer A and tends to mix up the words in A and the changed answer A ′ (like in the second example, Seq2Seq mixed "gymnastic" and "basketball" together.) Also, Seq2Seq cannot precisely change everywhere what should be modified, for example, in the second example, Seq2Seq failed to change "gymnastic coach" to "basketball coach". In the SMG methods, when we are using the gold template for partial generation, the model is able to generate the correct words aiming to change Q's answer to A ′ . Although there is still some risk to have some answer-related tokens left unchanged due to the error in the predicted template, the context tokens in the predicted template are ensured to be generated. Therefore, our model with predicted template is more fit for NLP products than Seq2Seq.

Conclusion
In this paper, we proposed a novel task, the goal of which is to modify some content of a given text to make the answer of a text-related question change to a given new answer. This task is very useful in many real-world tasks, like contract editing. We constructed a test set for evaluation and released this test set. We also proposed a novel model SMG to solve this task. In SMG, we first use a selectorpredictor structure to select the answer-related tokens in the input document, then we use a novel partial generation technique to generate the modified document without changing answer-unrelated Input D: george evans -lrb-born 13 december 1994 -rrb-is an english footballer who plays as a midfielder or centre-back for manchester city .
D: andrei UNK -lrb-born 1975 in satu mare , romania -rrbis a retired romanian aerobic gymnast . he had a successful career winning four world championships medals -lrbtwo gold , one silver , and one bronze -rrb-after his retirement in 1997 he went with to germany where he works as a gymnastics coach at the UNK gymnastics club in hanover . D: andrew justin stewart coats -lrb-born 1 february 1958 -rrb-is an australian -british academic cardiologist who has particular interest in the management of heart failure . his research turned established teaching on its head and promoted exercise traininglrb-rather than bed rest -rrb-as a treatment for chronic heart failure . he was instrumental in describing the " muscle hypothesis " of heart failure . Q: position Q: discipline Q: nationality A ′ : halfback quarterback A ′ : basketball player A ′ : philippines filipino Seq2SeqD ′ : george evans -lrb-born 13 december 1994 -rrb-is an english footballer who plays as a midfielder or centre-back for manchester city . he was a quarterback halfback in the manchester .
andrei UNK -lrb-born 1975 in satu mare romania is a retired romanian aerobic gymnast basketball he had a successful career winning four world championships medals -lrb-two gold , one silver , and one bronze , after his retirement in 1997 he went with to germany where he works as a gymnastics coach at the UNK basketball club andrew justin stewart coats -lrb-born 1 february 1958 is an filipino -british academic cardiologist who has particular interest in the management of heart failure . his research turned established teaching on its head and promoted exercise training -lrb-rather than bed rest -rrb-as a treatment for chronic heart failure . he was instrumental in describing the " muscle hypothesis " of philippines With gold template (SMG(g))D ′ : george evans -lrb-born 13 december 1994 -rrb-is an english footballer who plays as a halfback and quarterback for manchester city .D ′ : andrei UNK -lrb-born 1975 in satu mare , romania -rrb-is a retired romanian basketball player . he had a successful career winning four world championships medals -lrb-two gold , one silver , and one bronze -rrb-after his retirement in 1997 he went with to germany where he works as a basketball coach at the UNK basketball club in hanover .
andrew justin stewart coats -lrb-born 1 february 1958 -rrb-is an filipino academic cardiologist who has particular interest in the management of heart failure . his research turned established teaching on its head and promoted exercise training -lrb-rather than bed rest -rrb-as a treatment for chronic heart failure . he was instrumental in describing the " muscle hypothesis " of heart failure .
With predicted template (SMG(p))D ′ : george evans -lrb-born 13 december 1994 -rrb-is an halfback footballer who plays as a midfielder or quarterback for manchester city .
andrei UNK -lrb-born 1975 in satu mare , romania -rrb-is a retired romanian basketball player . he had a successful career winning four world championships medals -lrbtwo gold , one silver , and one bronze -rrb-after his retirement in 1997 he went with to germany where he works as a gymnastics coach at the UNK gymnastic club in hanover .D ′ : andrew justin stewart coats -lrb-born 1 february philippines academic cardiologist who has particular interest in the management of heart failure . his research turned established teaching on its head and promoted exercise training -lrb-rather than bed rest -rrb-as a treatment for chronic heart failure . he was instrumental in describing the " muscle hypothesis " of heart failure . Table 5: The example generated cases of competing methods. The underlined tokens are gold answer-related tokens. The bold tokens in the "Input" row are predicted answer-related tokens. In the other three rows, the bold tokens are the modified tokens that are related to the given new answer A ′ . tokens in the original document. The experiments proved the effectiveness of our model.