Differentially Private Language Models for Secure Data Sharing

To protect the privacy of individuals whose data is being shared, it is of high importance to develop methods allowing researchers and companies to release textual data while providing formal privacy guarantees to its originators. In the field of NLP, substantial efforts have been directed at building mechanisms following the framework of local differential privacy, thereby anonymizing individual text samples before releasing them. In practice, these approaches are often dissatisfying in terms of the quality of their output language due to the strong noise required for local differential privacy. In this paper, we approach the problem at hand using global differential privacy, particularly by training a generative language model in a differentially private manner and consequently sampling data from it. Using natural language prompts and a new prompt-mismatch loss, we are able to create highly accurate and fluent textual datasets taking on specific desired attributes such as sentiment or topic and resembling statistical properties of the training data. We perform thorough experiments indicating that our synthetic datasets do not leak information from our original data and are of high language quality and highly suitable for training models for further analysis on real-world data. Notably, we also demonstrate that training classifiers on private synthetic data outperforms directly training classifiers with DP-SGD.


Introduction
Rapid advancements in the field of deep learning and natural language processing (NLP) have enabled companies, public institutions and researchers to extract information and gain knowledge from large-scale data generated by individuals.In many cases, it is desirable to share such data Figure 1: Main idea of our paper: To share potentially sensitive datasets with third parties, we train a language model (LM) on the sensitive data in a differentially private manner and consequently prompt the LM to generate synthetic samples with privacy guarantees.
with third parties, for example when analyses are performed by external consultants or in order to provide high quality benchmarks for the research community.This, however, entails a variety of risks related to privacy that cannot merely be solved by pseudonymization: A variety of deanonymization attacks enable the re-identification of individuals from tabular data such as movie ratings (Narayanan and Shmatikov, 2008), geolocation data (Lee et al., 2017) and notably also text (Koppel et al., 2009;Shrestha et al., 2017;Fabien et al., 2020).It is therefore highly desirable to develop anonymization mechanisms enabling secure data sharing, ideally with mathematical privacy guarantees as granted by differential privacy (DP) (Dwork and Roth, 2014).
Existing approaches anonymize every text sample individually by obtaining differentially private vector representations (Weggenmann and Kerschbaum, 2018;Fernandes et al., 2019) or using sequence-to-sequence approaches that rewrite a given sample to eliminate user-revealing information (Shetty et al., 2018;Feyisetan et al., 2019aFeyisetan et al., , 2020a;;Weggenmann et al., 2022), thereby following local differential privacy.As pointed out by Mattern et al. (2022), local DP requires a very high degree of noise which often leads to incoherent language and only little semantic overlap.The strict requirements of local DP are, however, not necessary if we assume that an entity aiming to share data already has access to the full collection of user-written texts and only wants to release an anonymized version of it.
In this paper, inspired by recent advances demonstrating the feasibility of training large language models (LLMs) in a differentially private manner (Li et al., 2021), we propose a globally differentially private data release mechanism relying on the generation of a "twin" dataset of the original, sensitive user data from large language models.As depicted in Figure 1, we train GPT-2 (Radford et al., 2019) to generate texts of our original dataset based on prompts inferred from the sample's individual attributes such as sentiment or topic.For fine-tuning, we use a differentially private optimization algorithm in order to protect the content of our training data.Subsequently, we sample from the trained model to generate a large number of synthetic, anonymous texts, resulting in a verifiably private "twin" dataset.We carefully evaluate our proposed method using popular NLP datasets such as IMDb movie reviews or Amazon product reviews.Here, we find that even after learning with strong privacy guarantees such as ϵ = 3 or ϵ = 8 from only a very limited amount of training samples such as 25 or 50, our generated data is of high quality and the classifiers trained on it achieve accuracies only ∼3% lower than those trained on the full original dataset containing thousands of samples.Notably, we also find that transformer based classification models trained on private data outperform models trained on real data with differentially private optimization.Finally, we show that the differentially private fine-tuning procedure effectively minimizes the risk of data leakage from language models that was previously discovered by Carlini et al. (2021).

Differential Privacy
Differential privacy (DP) is a formal notion of privacy that is currently considered the state-of-the-art for quantifying and limiting information disclosure about individuals.It has been introduced by Dwork et al. (2006a) under the name ϵ-indistinguishability with the goal of giving semantic privacy by quantifying the risk of an individual that results from participation in data collection.
In the original, central model of DP, we consider adjacent datasets that differ by at most one record (i.e., one individual's data).A differentially private query on both databases should yield matching results with similar probabilities, i.e., answers that are probabilistically indistinguishable.This is achieved via random mechanisms that return noisy query results, thus masking the impact of each individual.Definition 1.Let ϵ > 0 be a privacy parameter, and 0 ≤ δ ≤ 1.A randomized mechanism M on X fulfills (ϵ, δ)-DP if for any pair of adjacent inputs x, x ′ ∈ X , and all sets of possible outputs In the local model (Duchi et al., 2013), noise is added locally at the data source, before the data is collected and stored in a central database.A basic example is randomized response (Warner, 1965), where each survey participant either provides a truthful or a random answer depending on the flip of an (unbiased) coin.The local model makes the strong assumption that any two inputs are considered adjacent, which often makes it difficult to achieve a satisfying privacy-utility trade-off.

Differentially Private Optimization
An important application of DP is privacypreserving machine learning to protect the privacy of the training data.Typically, neural networks are trained by optimizing a loss function using stochastic gradient descent (SGD) or a derived method such as Adam (Kingma and Ba, 2015), which iteratively compute gradients of the loss function over batches of samples from the training dataset.As shown by Song et al. (2013a); Bassily et al. (2014a); Abadi et al. (2016a), it is possible to implement a differentially private version of SGD (DP-SGD) by clipping the gradients and applying the Gaussian mechanism (Dwork and Roth, 2014): The latter works by applying noise from an isotropic Gaussian distribution N (0, σ 2 I), where the standard deviation σ is derived based on the desired privacy parameters ϵ and δ.
To achieve good privacy-utility trade-offs, it is important to accurately track the total privacy budget spent throughout the entire training.In the context of DP, repeated executions of the same (here: Gaussian) mechanism is referred to as composition.Basic (Dwork et al., 2006b) and various more refined, advanced composition theorems (Dwork et al., 2010;Dwork and Rothblum, 2016;Bun and Steinke, 2016) have been stated in the literature that aim at providing tight bounds for the overall privacy budget.However, these advances still resulted in relatively loose bounds and thus large overall privacy budgets over the course of highly iterative algorithms such as DP-SGD.Tight worst-case bounds for composition were derived by Kairouz et al. (2015), however, it was shown to be computationally infeasible to compute them in general (Murtagh and Vadhan, 2016).
For this reason, specific efforts have been made to find tighter bounds and accurate approximations for the overall privacy loss: A first example that provides substantial reduced upper bounds is the moments accountant (Abadi et al., 2016a), which is closely related to Rényi DP (Mironov, 2017), a generalization of DP based on Rényi divergence.Gaussian and f -DP (Dong et al., 2019) provide an approximation of the total budget using the central limit theorem (CLT).Finally, Gopi et al. (2021); Koskela et al. (2020), inspired by Sommer et al. (2019), are able to compute the exact budget numerically up to arbitrary precision by aggregating the privacy loss random variable with fast Fourier transform.

Approach
We consider the following scenario to motivate our approach: an entity wants to implement NLP pipelines to gain insights from internal data, e.g., emails from customers.To seek advice and get support for modeling the data and building pipelines, the entity aims to share an excerpt of the internal data with a third party such as a consultant or a group of researchers.In order to do this without compromising the privacy of its customers, the aim is to synthesize a verifiably private "toy" dataset that reflects the properties of the original data without leaking private information.On such a toy dataset, a third party could research how to best solve the task at hand and train a model to perform inference on the actual internal data, without being able to access sensitive information about cus-tomers.Formally, we aim to achieve the following goal: We consider a dataset consisting of a training set D train and test set D test .Given D train or a subset of it, we want to train a generative model to synthesize a dataset ‹ D train that does not leak information from the original D train .Furthermore, the synthesized dataset should share statistical properties with the original one so that a classification model trained on ‹ D train performs as well as if it was trained on D train when making predictions about D test .
To achieve this, we use the pretrained autoregressive transformer model (Vaswani et al., 2017) GPT-2 (Radford et al., 2019) and use natural language prompts to enable the conditional generation of text based on desired textual attributes such as its sentiment, domain or genre provided in the prompt.Furthermore, we introduce a new training objective that penalizes the generation of samples fitting another label to reduce the risk of faulty labeled samples in our synthetic dataset.Finally, we fine-tune our model using a differentially private optimizer to provide privacy guarantees for our training data and to prevent information leakage from our model when subsequently sampling our synthetic dataset.

Conditional text generation with natural language prompts
As we want to control specific textual attributes of our synthetic data, we need to train our model in a manner that allows us to generate different types of texts corresponding to the desired attributes or labels present in our dataset.We consider a text sample to correspond to a set of M attributes of interest, namely A := {a 1 , a 2 , . . ., a M }, where each attribute a j can take on a set of categorical values C j .In the case of product reviews, a 1 could be the sentiment of a review that can take on the values a 1 ∈ C 1 = {Positive, Negative} and a 2 can be the product category, so that a 2 ∈ C 2 = {Books, Electronics, DVD, Kitchen}.Our goal is to learn a model p(x|a 1 , ..., a M ) in order to controllably synthesize text samples according to our desired attributes.
A straightforward approach to realize this would be to train a single generative model for all possible attribute value combinations.This approach is, however, highly memory-intensive, as it requires us to store the weights of a large number of models that grows exponentially with the number of categorical attributes.Following recent work (Schick and Schütze, 2021a), we therefore train a single language model to conditionally generate texts based on task instructions.Beyond reducing our memory needs, this approach allows us to leverage our model's pretraining knowledge and to perform text generation with only very little training samples (Schick and Schütze, 2021a).Our instructions i(a 1 , .., a M ) are formed using a template with placeholders that are filled out with verbalizations v(a j ) taking on different forms for different values of every attribute a j .An example of such an instruction template is visualized in Figure 2.

Write a [sentiment] review about a [product]:
During the training stage, we use a differentially private optimizer to fine-tune our language model to generate each text sample within the original dataset based on the prompt corresponding to its individual attributes.Subsequently, we can synthesize a new dataset by controllably sampling text based on our desired attributes passed in the prompt.To generate a private "twin" dataset, one might use the same distribution of textual attributes as in the original dataset.Alternatively, the instructionbased approach allows us to control and change such ratios, for instance if we desire to debias our original data.

Reducing faulty labels with prompt-mismatch objective
The standard training objective for autoregressive language modeling is to minimize the negative loglikelihood (NLL) of every token given its previous tokens.We incorporate the natural language instructions (Radford et al., 2019;Brown et al., 2020) into this training objective.For every text sequence x and its corresponding attribute values a := (a 1 , ..., a M ), we construct the concatenated sequence i(a) ⊕ x which prepends a corresponding task instruction to each text sample.Let L denote the length of this concatenated sequence and let w l be the sequence's l-th token.Our NLL loss is now (2) This objective encourages the model to generate correct samples for a given instruction.However, it does not minimize the likelihood of generating wrong samples corresponding to another prompt and therefore attribute.This is specifically unfavorable for our goal of generating synthetic training datasets as every generated text having an error of this kind corresponds to a wrongly labeled training sample.To address this, we extend the training objective with a term penalizing the generation of a given sample for a wrong prompt.Specifically, let I wrong denote the set of all prompts not matching the given attribute values a 1 , ..., a M , specifically We now define the overall training loss we are aiming to minimize as where λ is the hyperparameter to balance the two losses.Note that in practice, when the number of possible labels is high, this computation might be inefficient and the objective too complex for the model to realize.In this case, one might randomly sample a few class labels for the wrong prompt in every training batch or penalize the generation for class labels that are the most similar to the correct one.

Evaluation
We conduct extensive evaluation measuring the utility and privacy of our generated data as well as the quality of its language.In this section, we describe the datasets we use as well as our evaluation metrics, implementation details and results.

Datasets
We use two publicly available datasets that are widely used for evaluating the performance of text classification models:

IMDb Movie Reviews
The IMDb movie review dataset 2 consists of movie reviews written by various authors.We use the two binary sentiment labels as attributes to condition our model on and use a random subset of 5,000 reviews for training and evaluation each.

Amazon Multi Domain Reviews
The Amazon multi domain review dataset was introduced by Blitzer et al. (2007) and consists of two thousand product reviews from each of the four product categories books, DVDs, electronics and kitchen appliances.Both binarized sentiment labels and the product categories books and electronics serve as attributes we consider.Our resulting training data consists of 3,000 training samples and 1,000 test samples.

Implementation Details
We implement and train our language models using the PyTorch (Paszke et al., 2019) and Hugging Face Transformers (Wolf et al., 2020) libraries and the 1.5B parameter implementation of GPT-2 (Radford et al., 2019).To fine-tune the language models, we employ the "privacy engine" of the private-transformers 3 package by Li et al. (2021).In line with their experiments, we also use DP-Adam (Dong et al., 2019;Bu et al., 2020), a differentially private version of the Adam (Kingma and Ba, 2015) optimizer.The privacy engine allows us to specify desired target privacy parameters ϵ and δ, from which the standard deviation parameter σ for the Gaussian mechanism is derived using either Rényi DP (Mironov, 2017), the CLT (Dong et al., 2019), or the FFT accountant (Gopi et al., 2021) in our experiments.To obtain reliable results for training our generative models on small subsets of the training samples, we sample three random subsets for every size and report averaged results from these three experimental runs.We trained GPT-2 over five epochs when using a differentially private optimizer and merely two epochs when using a non-private optimizer, as the latter tended to overfit quickly on the small training set.To further mitigate this, a smaller learning rate turned out to be more effective for non-private optimization: While we used a learning rate of 8e-6 with DP-Adam, we obtained the best results for non-private optimization with a learning rate of 5e-7.Lastly, we chose the hyperparameter λ := 0.2.We generated our synthetic datasets using the original distribution of sentiment and product category attributes, which was 50 / 50 in all cases.To sample from GPT-2, we use nucleus sampling (Holtzman et al., 2020) with p = 0.8 across all experiments.Our results were not obtained through an extensive hyperparameter search but educated guesses over a couple of iterations to avoid large computational effort.All experiments were performed using a NVIDIA Tesla A100 GPU.With this setup, a training epoch over 1,000 text samples took approximately five minutes.

Experimental Results
As stated previously, we aim to synthesize datasets that (1) reflect properties of the original data and can be used to train classifiers that perform similar to those trained on the original data, (2) are private and do not leak information from the original data and (3) are diverse and of high language quality.Accordingly, we perform experiments with metrics maesuring these attributes and report our results in the following:

Data Utility
To measure the utility of our datasets, we train classification models for each attribute on both our original data and the generated data and compare their performances when evaluating them on our real test data.Ideally, our anonymized twin datasets should lead to classifiers that are as accurate as those trained on our original data.To account for various settings including those with computational constraints, we train a shallow support vector machine classifier based on Tf-idf encodings as well as a deep BERT (Devlin et al., 2019) based classifier with 110M parameters.Furthermore, as an interesting baseline, we evaluate the performance of BERT trained on real data with differentially private optimization using the code from Li et al. (2021).The classification accuracies for models trained on synthetic data are shown in Table 1 and classification accuracies for models trained on real data are shown in 2. In the following, we summarize our key findings: Synthetic data is almost on par with real data: The performance of classification models trained on generated data only drops minimally compared to those trained on real data.Across all three classification tasks, for both datasets with privacy budgets of ϵ = 3 and ϵ = 8, the accuracy of BERT and Tf-Idf based models is never less than 3% of the accuracy obtained for real data when given access to all training samples (5,000 and 3,000 for IMDb and Amazon, respectively).
Classifiers trained on synthetic data outperform private classifiers trained on real data: Notably, when comparing the results of transformer based classifiers trained on our synthetic data (Table 1) to those trained on real data with differentially private optimization (Table 2), we find that the former substantially outperforms the latter across all tasks for both ϵ = 3 and ϵ = 8.This raises the question whether the intermediate step of private data generation should always be performed rather than training classifiers with DP-SGD.
Private data generation shows high utility in few-shot settings: Lastly, even when given only as little as 25 or 50 samples, GPT-2 can generate datasets that lead to high-performing classifiers, which can most likely be attributed to the utilization of pretraining knowledge through our prompting techniques.Therefore, beyond the anonymization of existing datasets, our method can be used to enlarge existing small datasets in a private manner.

Data Privacy
To the best of our knowledge, methods aiming to measuring the privacy of textual data are an active area of research (Carlini et al., 2021;Brown et al., 2022) and there is no standardized and agreed upon way to do so.In our experiments, we follow Carlini et al. (2021) by counting the number of instances in which our synthetic dataset contains samples that are extremely close to a sample from the training data and can therefore be considered a duplicate: For every sample x i from our training data used for the language model and every x j ∈ ‹ D train , we measure the set of trigrams g 3 (x i ), g 3 (x j ).We consider the two samples as duplicates if As we hypothesize that duplicates are relatively rare, we double the generated data compared to our utility experiments and search for them within 10,000 and 6,000 samples generated for the IMDb and Amazon dataset, respectively.Our results are depicted in Table 3 and demonstrate the significant reduction of data leakage from privately trained models.

Language Quality
As a metric measuring the quality of our generated samples, we use the Mauve4 (Pillutla et al., 2021) score to compute the similarity of the distributions of D train and the generated data ‹ D train from every trained model.As can be seen in Table 4, higher ϵ values tend to increase the quality measured by Mauve, but overall seem not to be highly significant.As a reference, the Mauve score computed when comparing D train and D test are 0.95 for IMDb and 0.94 for Amazon.Based on manual inspection, the quality of generated texts seems to be very high.Mismatches between prompts and generated texts (e.g. a negative review generated for a positive prompt) as well as incoherent generations do occur, but very rarely.Excerpts of the generated data can be seen in Table 5, failure cases can be found in Table 6 and 7  5 Related Work

Text Anonymization
Substantial efforts have been made to enable the privacy-preserving processing of textual data through both private textual vector representations and by transforming text into readable anonymous formats.Approaches from the former category either aim at obtaining term frequency vectors using differentially private mechanisms (Weggenmann and Kerschbaum, 2018;Fernandes et al., 2019) or by using deep learning methods with adversarial training objectives (Coavoux et al., 2018).In the work by Qu et al. (2021), various local DP mechanisms are explored to obtain private BERT representations.
Methods aiming at rewriting texts in a privacypreserving manner range from rule-based approaches using human-engineered text perturbations (Mahmood et al., 2019;Bevendorff et al., 2019) as well as word replacements through the perturbation of individual word embeddings using differential privacy (Feyisetan et al., 2019b(Feyisetan et al., , 2020b) ) to deep learning based approaches leveraging sequence-to-sequence models.These sequenceto-sequence models can either incorporate adversarial objectives penalizing the generation of authorrevealing information (Shetty et al., 2018;Xu et al., 2019) or integrate differential privacy in the text sampling process (Bo et al., 2021;Weggenmann et al., 2022;Mattern et al., 2022).
Notably, various papers proposing the integration of differentially private mechanisms in deep learning architectures (Krishna et al., 2021;Beigi et al., 2019a,b;Alnasser et al., 2021) have been shown to actually violate differential privacy (Habernal, 2021(Habernal, , 2022)).While these works still represent important contributions due to their good empirical results, it should be noted that the design of NLP systems with DP guarantees is a task that is prone to errors and should be approached carefully.

Differentially Private Language Model Training
As generative language models have been shown to leak training data (Carlini et al., 2021) and the embeddings of discriminative models have been shown to contain sensitive information about a text's originator (Song and Raghunathan, 2020), differentially private optimizers such as DP-SGD (Song et al., 2013b;Bassily et al., 2014b) and DP-Adam (Abadi et al., 2016b;Kingma and Ba, 2014) have been applied to a variety of NLP tasks.Large-scale pretraining of BERT using DP-SGD has shown to reap comparable masked language modeling performance to non-private BERT (Anil et al., 2021).For the tasks of text classification and named entity recognition, good performance has been obtained with BERT and DP-SGD, but only with large privacy budgets of ϵ = 100 or higher.
Recently, it has been demonstrated that with the correct choice of hyperparameters and fine-tuning objectives aligned with the pretraining procedure, both generative and discriminative language models can achieve high performance in various tasks even with stricter privacy bounds (Li et al., 2021;Yu et al., 2022).An active area of research is concerned with the empirical evaluation of a language model's privacy (Brown et al., 2022) using methods such as membership inference attacks (Hayes et al., 2019).

Conclusion
In this paper, we explored the generation of synthetic datasets from differentially private language models as a solution for publicly sharing textual data while protecting the privacy of users whose data is being shared.Our experiments show that synthetic data from differentially private language models is of high quality and is very well suited as training data for further tasks while significantly reducing the risk of leaking the original data.Our approach can be applied in a variety of use cases working with sensitive data.An interesting challenge for future work is the anonymization of multimodal datasets consisting of tabular, visual and text data.

Limitations
Privacy Guarantee While differential privacy provides a statistical privacy guarantee, one can not be certain that a differentially private language model does not leak any sensitive information.As seen in our experiments, the differentially private models did leak some of their training data, even if significantly less than the non-private ones.This can be a concern when dealing with training data containing names, telephone numbers or even passwords.
Synthetic Data Quality As shown in Table 6 and 7, our models did in rare cases produce incoherent language or text samples that did not fit the desired control attributes.This can limit the quality of the generated data.

Limits of Controllable Generation
The controllability of multiple fine-grained textual attributes in text generation remains a difficult challenge Lyu et al. (2021).We therefore need to assume that our approach will become less accurate the higher the amount of textual attributes we want to consider.

Ethical Considerations
Data privacy is a highly important issue for the responsible deployment of machine learning solutions.With our work, we directly contribute to this field of research.
As our method relies on large pretrained language models, it should be noted that users deploying these technologies need to be aware of their undesirable, human-like biases (Sheng et al., 2019;Abid et al., 2021).Methods for reducing these harmful associations are actively being developed by the research community (Liang et al., 2021;Schick et al., 2021).I have read and enjoyed many of your books, but I can't recommend this one highly enough.I am not a big fan of vampires.They don't interest me at all This book, however, did.It was a fun, entertaining read.I loved the characters.I really didn't care if I was going to be able to go back in time to help save the future, or if I would be in for the long haul.There was a lot of tension throughout the book.I was glad I was able to finish it.I'm glad that I was not as upset with it as many others were.I gave this book a solid 5 stars.The screen has a nice clear picture and has great picture quality, the speakers are good, the microphone is not great, the battery is adequate, and it's nice and light.This is the first of the Vivid series I have bought, and I'm glad I got one.The device itself, while very good, is bad.It seems like the software doesn't like to write out its own ID (it should) so the app doesn't work.I've seen other users say the same thing and have had no problems.I haven't bought another bluetooth device, but I'm pretty sure it has something to do with it.25 8 Category=Books Sentiment = Positive I really enjoyed this book.I didn't get the full explanation as I am only a beginner but for beginners with the basic knowledge of a math and physics the theory is very sound.2.0 stars.

4,000
∞ Category=Books Sentiment = Negative This book is a compilation of some of my personal thoughts and ideas.The author's goal was to make me believe in witchcraft, but he completely failed.50 8 Category=Electronics Sentiment = Positive This is the best little portable guitar i've bought since the little portable juke box on a bike.The high quality quality quality of this product is great .I do wish i had bought a sooner the old jukebox but once you get it together its really nice to have.

Figure 2 :
Figure 2: Our template-based approach for generating task instructions.A template consists of placeholders for verbalizations of different attribute values.
for a book on how to be a good leader and found it by chance.It seems that no matter how bad you have been, how bad things have gotten, or how terrible a person you are, a leader can still turn it around.a huge fan of the Star Wars films but I really enjoyed this one.It is the first time I've ever seen the Star Trek movies and I enjoyed them as well.The plot is a bit different then you'd expect from a Star Trek story but I enjoyed it all the same.brand new LG Electronics Vivid 5, and so far I'm very happy.

Table 1 :
Accuracy of classification models trained on synthetic data.

Table 2 :
Accuracy of classification models trained on real data.

Table 3 :
Number of duplicates from the training data generated by language models

Table 5 :
(Reimers and Gurevych, 2019)from models trained with only 25 text samples.The texts were selected by picking a random sample from D test and finding the most similar one according to Sentence-BERT(Reimers and Gurevych, 2019)within the generated data.Original I'm not any expert in the computer world.But this product it is very easy to install.The range is very strong.This is a good piece of equipment.It is small, has a high quality and a reasonable price.I used it to test the sound from my computer monitor.I am happy with the sound.ϵ=∞I got this game to play on my PSP as a little side project.I'm quite happy with it.The graphics are pretty good and the story line is pretty entertaining, as well.Original Good Lord, people must be busier than I thought.If you haven't been allowed to even think about your future after retirement, then maybe you will get something out of this book.For the rest of us, this is drivel ϵ = 3 I have been looking for an awesome, fun, easy to read book, like a classic from when I was a kid.I am very disappointed.It is not a fun read.It really is a bore.I have read the sequel, and am also disappointed.This is not one of those things that you read to pass the time.Not a good choice for a first book.ϵ= 8 I would've done better with another book I read.The plot is interesting but the characters are bland and the setting is really only a backdrop to the action and plot.It's a shame as I enjoyed the book, but this isn't a good read.This book was boring, boring, and boring.I have been thinking about getting a new copy of this book ever since I read it, but this one didn't work for me at all.Not a bad idea, just not my cup of tea.

Table 6 :
Failure cases in which the generated text does not fit the desired attributes.Model mistakes are marked in red.

Table 7 :
Failure cases in which the generated text is incoherent or does not make sense.Model mistakes are marked in red.