Nano: Nested Human-in-the-Loop Reward Learning for Few-shot Language Model Control

Pretrained language models have demonstrated extraordinary capabilities in language generation. However, real-world tasks often require controlling the distribution of generated text in order to mitigate bias, promote fairness, and achieve personalization. Existing techniques for controlling the distribution of generated text only work with quantified distributions, which require pre-defined categories, proportions of the distribution, or an existing corpus following the desired distributions. However, many important distributions, such as personal preferences, are unquantified. In this work, we tackle the problem of generating text following arbitrary distributions (quantified and unquantified) by proposing Nano, a few-shot human-in-the-loop training algorithm that continuously learns from human feedback. Nano achieves state-of-the-art results on single topic/attribute as well as quantified distribution control compared to previous works. We also show that Nano is able to learn unquantified distributions, achieves personalization, and captures differences between different individuals' personal preferences with high sample efficiency.


Introduction
Recent developments in large language models (Radford et al., 2019;Brown et al., 2020) have advanced the state of automated text generation.However, to apply them to real-world tasks, it has become increasingly desirable to reduce social bias exhibited in large language models (Bender et al., 2021), improve fairness (Baldini et al., 2022), and fit to diverse individual preferences (Xue et al., 2009).These desired properties are only defined over a set of generated text instead of individual sentences.Therefore, they require control over the distribution of generated text (Khalifa et al., 2021).

Our personalized generation
The hotel is very awesome because it is located in a great neighborhood accessible to the rest of the city.GPT-2 (Radford et al., 2019) The hotel is very awesome because I always feel like I can get a better experience.
Table 1: Comparison between personalized generation from NANO vs. GPT-2.Our model is able to capture personal preferences with few-shot learning.
Existing works on distribution control deal with quantified distributions: they require knowledge of a known number of categories associated with each data point, an existing corpus following the desired distribution (Gao et al., 2022;Wang et al., 2018;Li and Tuzhilin, 2019), or a well-defined distribution with known proportions (Khalifa et al., 2021) (such as x% category A, y% category B, etc.).However, unquantified distributions, such as arbitrary subjective distributions (e.g."news I find surprising" for an arbitrary person), are relatively understudied.Because many distributions, including personal preferences, are fundamentally unquantified a priori, the ability to learn unquantified distributions in a few-shot manner is key to modeling these distributions.
Our key insight for tackling arbitrary distributions is to continuously learn from intermediate human feedback, which points us towards the right direction at every step, instead of learning the final categories in one step.To this end, we propose Nested Human-in-the-Loop Reward Learning (NANO), a few-shot controllable text generation algorithm with two nested loops: the outer loop is a cycle of three learning phases (generation, human feedback, and training), and we introduce an inner loop in the generation phase, where we perform a tree search with nodes sampled from a language model, to address the issue of lack of samples.Furthermore, we find that human-in-the-loop training not only enables learning unquantified distributions, but also improves performance on quantified distributions.Our contribution is summarized as follows: • We introduce a human-in-the-loop reward learning algorithm that learns to generate text following arbitrary distribution through human feedback.We demonstrate that our method works for all of the following types of distributions: single-topic/attribute, quantified distributions, and unquantified distributions.
• We show that NANO is able to learn unquantified distributions, successfully achieves personalization, and captures differences between different individuals' personal preferences with only 64 labels from each person (RQ1).
• We achieve state-of-the-art result on controlling quantified distributions (RQ2) as well as single topic/attribute generation (RQ3) compared to previous works, while using only few-shot samples.
• Through ablation studies, we demonstrate the necessity of multi-iteration human feedback for high sample efficiency (RQ4) and justify our architecture's design choices (RQ5).We also show that our method extends to newer and larger language models than GPT-2.
An illustration of our method is shown in Figure 1, and a comparison of NANO's capabilities to previous works is provided in Table 2.

Related Work
Text generation models are models designed to generate natural language.Natural language generation tasks include prompt completion, text sum-marization, translation, style transfer, etc.Current state-of-the-art language models include large transformer-based models, such as GPT-2 (Radford et al., 2019) and GPT-3 (Brown et al., 2020).These models are pre-trained on large corpus of text with masked token prediction, and can be easily finetuned to perform various text generation tasks as well as classification tasks.GPT-neo (Gao et al., 2020) is one version of GPT that is specifically designed to allow few-shot learning of tasks.Recent advancements in text generation models also allows text generation models to follow text instructions, such as InstructGPT (Ouyang et al., 2022).Before transformer-based models, natural language generation via template-based methods or hand-coed grammar-based systems (Gatt and Krahmer, 2018) has also been explored.In our paper, we use GPT-2 (355M) as our baseline model.
Controllable text generation are techniques to generate text in a controllable fashion.Previous works have aimed to control generation towards specific topics or attributes (including classifierbased approach (Dathathri et al., 2020) and reinforcement learning based approach (Lu et al., 2022)) and control style of generated text via style transfer (including statistical NLP methods (Hovy, 1987;Xu et al., 2012), neural generative models (Prabhumoye et al., 2018;Lample et al., 2019;He et al., 2020), Retrieve-and-Edit approaches (Li et al., 2018;Hashimoto et al., 2018;Guu et al., 2018;Sudhakar et al., 2019;Madaan et al., 2020), and Transformer-based approach (Lyu et al., 2021)).GDC (Khalifa et al., 2021) proposed distribution control as a constraint satisfaction problem where NANO PPLM (Dathathri et al., 2020) GDC (Khalifa et al., 2021) Ziegler (Ziegler et al., 2019) InstructGPT (Ouyang et al., 2022) QUARK (Lu et al., 2022) No Reliance on External Model the model is optimized towards a quantified distribution.Our approach can not only generate text following quantified distributions, but also control generation towards unquantified distributions, which cannot be specified with numerical proportions.In the context of alleviating text degeneration, Welleck et al. (2020) proposed the unlikelihood loss to reduce the likelihood of unwanted continuations, which also serves as the motivation underlying our complementary loss.However, instead of pushing the output away from the unwanted token (Welleck et al., 2020), complementary loss optimizes towards the remaining tokens and preserves the original probabilities of the remaining tokens assigned by the language model.Human in the loop (HITL) machine learning involves training or improving machine learning models with human feedback.Previous works on HITL in NLP (Wu et al., 2021a) utilizes HITL to improve text classification (Arous et al., 2021;Karmakharm et al., 2019), semantic parsing (Yao et al., 2019a,b), text summarization (Stiennon et al., 2020;Ziegler et al., 2019), dialog and question answering (Hancock et al., 2019;Wallace et al., 2019), and sentiment analysis (Liu et al., 2021).HITL is also widely used in text generation evaluation (Khashabi et al., 2021).In this work, we use HITL training as a part of the training process.While many existing HITL works require humans to write or rewrite sentences, our approach only requires humans to provide ratings, which is easier to perform.
Fairness of text generation.Unconditional language models have been shown to perpetuate undesirable stereotypes during generation which disproportionately harm underrepresented social groups (Liang et al., 2020;Ravfogel et al., 2020;Sheng et al., 2020Sheng et al., , 2019)).Previous works in natural language generation have attempted to mitigate bias through pretraining regularization (Bordia and Bowman, 2019), distributional policy gradient (Khalifa et al., 2021), and performing additional edits after generation (Liang et al., 2021;Lyu et al., 2021).In comparison, our approach utilizes human feedback to gradually refine the distribution towards the target, allowing for fair generation by training from only self-generated samples.
Personalization of text generation is generating text following personal preferences, habits, or views.Previous works in personalization of text generation includes GAN and frequent n-gram analysis (Yuan and Huang, 2019), personalized social media generation (Gao et al., 2022;Wang et al., 2018), personalized review generation (Li and Tuzhilin, 2019), and personalized dialog response generation (Wu et al., 2021b), which are specific to their respective domain of text and require an existing in-domain corpus to finetune the model.Our approach achieves personalization within a few iterations of Human-in-the-loop training without the need of existing large corpus and is thus more flexible for domains lacking existing corpus.
Reinforcement learning in natural language processing has shown promising results in previous works on tasks including dialog generation (Li et al., 2016;Yang et al., 2020;Zhao et al., 2019), question answering (Godin et al., 2019;Chali et al., 2015), summarization and paraphrasing (Li et al., 2017;Xu and Zhang, 2021;Alomari et al., 2022), and controllable text generation (Khalifa et al., 2021;Lu et al., 2022).Lu et al. (2022) proposed iterative reinforcement learning from an external classifier.In comparison, our method trains the classifier along with the language model to bootstrap from a pretrained LM without any additional data or model.Monte Carlo Tree Search (Coulom, 2006) was proposed in the context of minimax games, which resembles our tree search generation method.However, instead of backpropagating node values, we update model states from a critic network (Lillicrap et al., 2015) and resample from the model to obtain the next expansion node.

NANO
In general, controllable text generation operate on either an existing corpus of training examples or a description of the desired attribute or distribution.In this work, however, we adopt an active learning paradigm wherein human(s) can guide the model towards the desired attribute or distribution, allowing controlled generation with minimum manually written examples.
The outer loop of NANO is a "generate-feedbacktrain" loop.In each iteration of the loop, a number of samples are generated from what the model has learned so far (i.e. the model approximates P (x m+1:n |a, x 1:m ) as closely as possible).The generated samples are given to a human annotator, who rates the samples according to how accurately each conforms to the desired attribute or distribution.In addition, the human annotator can manually add new samples when the dataset lacks satisfactory samples.We keep the number of manuallyadded samples to a minimum (with a maximum of 5 added samples) while significantly reducing the number of rated samples in order to demonstrate our method's ability to self-improve with little human effort.Finally, the model is trained on the labeled dataset and the trained model is used for generating text in the next iteration.In the following subsections, we detail each component of the outer loop.

Generation
Consider the output space from a language model as a search tree.Each unique output sequence corresponds to a path from the root to a leaf where each node is a token.One could sample from the root downwards with the probability of choosing each child node prescribed by the language model.During early iterations, however, the language model does not have enough data to accurately generate the target probabilities.Alternatively, one could search for an optimal path at the cost of output diversity and naturalness.To incorporate the advantage of both methods, we perform a tree search with critic updates.We use a generative language model and a critic network to guide language generation: at each step, the sentence is sampled to the end, a soft loss and a hard loss for the whole sentence are extracted from the critic network, and the soft loss is backpropagated to update the hidden key-value pairs in the language model (Dathathri et al., 2020).The critic network is trained from human labels (except for the first iteration, where we only use the language model for generation) and takes full-sentence output embeddings (for the soft loss) or full-sentence output tokens (for the hard loss) from the language model as the input.The partially generated sentence is unrolled forward k times using the token probabilities from the language models.After obtaining k sentences, the next token with minimum hard loss is selected.An overview of the generation process is in Figure 2 and a detailed generation algorithm is provided in Algorithm 1.
It is important to note here that the language model and critic network need to share the same token embedding table, as the critic network takes language model output embeddings as input.A simple solution to this is to initialize both networks from a pretrained, autoregressive language model and freeze the embedding table throughout the training steps.

Human feedback
When collecting human feedback, each generated sentence x receive a rating r indicating how well x satisfies the desired attribute or distribution (higher for j in 1..k do Starting at the next token after x, sample x ′ from L with initial key-value history set to h until |x ∥ x ′ | = ℓ; let pi be the probability distribution at xi ← the next token with the least hard loss in c end for return the sequence with the least hard loss from c scores indicate similar sentences should occur more often and lower scores indicate similar sentences should occur less often).In order to provide a simple human interface, we consider ratings to be discrete integers from 1 to 2ν − 1 for some integer constant ν > 1; Ratings from 1 to ν − 1 indicate negative rating, rating ν indicates neutral rating, and ratings from ν + 1 to 2ν − 1 indicate positive rating.Each pair of (x, r) is added to the training set.
In addition to rating generated sentences, new sentences can be added to the training set when the attribute has a very low frequency in naturally generated text.A rating is provided along with the new sentence.The pair (x, r) is then added to the training set.

Training
At each iteration, both the language model and the critic network are initialized from pretrained GPT-2.

Training the generative language model
Language models have been traditionally trained with the negative log-likelihood (NLL) loss from positive labels.We augment the NLL loss with the complementary loss to incorporate both positive and negative labels: Given a sentence x and its rating label r, the language model L is fine-tuned as a generative model with the loss ℓ L (x, r) = The scaling factor k depends on the strength of the rating: k = |r−ν| ν−1 .The ground truth distribution q(v) is an indicator function that peaks at v = x i when the rating is positive; when the rating is negative, instead of discarding the sample or inverting the loss sign, we assign q(v) equal to the distribution p L (v | a, x 1:i−1 ) as predicted by the language model, after setting q(x i ) to 0 and renormalizing: We emphasize the significance of not discarding samples where r < ν (i.e.negative samples).During early stages when the model generation is poor, discarding negative samples results in the language model trained only on few positive samples, leading to less training signal and lower generation quality.Another straightforward solution is to descend the predicted words when given a negative label instead of ascending the remaining words.However, this method tend to destroy information in the language model, causing it not to output fluency sentences at all.

Training the critic network
The critic network C is fine-tuned to assign high loss to sentences with incorrect attributes, and low loss otherwise.The attribute we use depends on the desired distribution: Single-topic control.The simplest form of distribution is 100% on a single topic or attribute.In this case, a human label corresponds to the rating for this attribute.A straightforward method is to define the critic network as a (2ν − 2)-way classifier.However, this would result in the loss of the ordinal information of the classes.Instead, the classifier is augmented by interpreting the output score for each rating level t as the probability that the target rating should be greater than or equal to this rating level.Therefore, we define a rating loss for single-topic control as the sum of loss at each possible rating level: . When generating, the soft and hard losses are the weighted sum of losses at the positive ratings for some weights w(ν + 1) < w(ν + 2) < ... < w(2ν − 1): Distribution control.One of the most important goals of generation control is to control the distribution of topics.In particular, we would like to control the topic distribution from only rating information while allowing human to fine-tune the distribution by rating a topic as more positive or negative than another.We found that the classifier in single-topic control misleads the model to categorize distributions into rating levels.Instead, the critic is defined as a binary classifier and the negative log-likelihood loss from the critic network is interpreted as the strength by which the language model should be pulled towards each point in the distribution.The critic network is trained on a weighted negative log-likelihood loss, ℓ C (x, r) = −c log p C (a | x), given a sentence x and its rating label r.The magnitude of the scaling factor c is determined by the rating strength, and the sign is determined by the rating polarity: c = r−ν ν−1 .When generating, the soft and hard losses are simply the losses at the maximum rating, i.e. ℓ C (x) = ℓ C (x, 2ν − 1).

Experiments and Results
In the following experiments, we demonstrate the ability of NANO to generate text to (1) follow unquantified distributions and personalize, (2) follow quantified distributions, and (3) follow a single topic or attribute.

Unquantified Distribution and
Personalization RQ1.Can NANO learn to generate following unquantified distributions such as personal preferences?
One of the goals of NANO is to learn to generate text following unquantified distributions, such as distributions capturing personal preferences.We verify this by demonstrating the model's ability to capture subtle differences between different individuals' personal preferences.We ask two human annotators of different age and background to individually participate in a Human-in-the-loop training on separate models with the same topic, instructions, and model initialization.We use the following three starting prompts: "Surprisingly," "The hotel is very awesome because", "The restaurant is disgusting because", and ask the human annotators to rate, on a scale of 1-5, about how well the model completions fit their definition of "surprising," "very awesome," and "disgusting," respectively.We do a 4-iteration Human-in-the-loop training with 16 sentences in each iteration, and generate 50 sentences from each final model at the end.We combine and shuffle all 100 sentences (50 for each annotator) for each prompt and ask each human annotator to rate them (on the same scale of 1-5), and report the average score of the two annotators on each set of 50 sentences in Table 3 together with their respective average rating on the same batch of initial model generation.
The result shows that (1) each annotator, on average, rates generations from their own trained model significantly higher than initial model generation, showing that NANO is able to learn to follow these unquantified subjective distributions, and (2) both annotators give higher average ratings to the sentences generated by the model of their own training compared to the sentences generated by the model trained by the other annotator in all 3 prompts, indicating that the model is able to capture different personal preferences because the model trained by the annotator is more likely to generate sentences that fits the annotator's own personal preferences than the model trained by another annotator, even though both annotators are given the exact same instructions, prompts and initial model.For example, as shown in Table 4, under the prompt "This hotel is very awesome because", the model trained by annotator 1 more frequently generates descriptions of great indoor rooms and facilities, while the model trained by annotator 2 more frequently generates descriptions of convenience of location of the hotel.The models reflect the annotators' personal preferences of hotels as they both rate sentences generated by their respective models higher than the other model's generation.These results provide evidence that human annotators reflect their personal preferences through ratings, and the model is able to capture these preferences.More examples are shown in Table 18 in Section B.3.
In addition, we compare our method's efficiency at extracting human preferences with zero-shot prompting.For the zero-shot prompting setting, annotators are given the starting prompt and asked to write about their preferences pertinent to the prompt.The combined prompt is "<annotator prompt>\n\n <original prompt>" An example of such combined prompt is "I prefer cheaper rooms and ease of access to the rest of the city [...]\n\n This hotel is very awesome because".
We limit the time of human interaction to a fixed time budget, and compare the results of (1) prompting only, (2) NANO only and (3) combining prompting and NANO.As we can see from Table 3: Average ratings (on a scale of 1-5) of two annotators on 50 sentences generated by each other's human-inthe-loop-trained model with prompts "Surprisingly," "This hotel is very awesome because", and "This restaurant is disgusting because".After training, each annotator gives their trained model's generation higher ratings on average compared to the initial model generation, and also higher ratings compared to the generation of the model trained by another annotator.This shows that NANO is able to learn to generate text following unquantified distributions that reflects personal preferences.These results are statistically significant (see section A.5 for significance test results).

Model Trainer Annotator 1 Annotator 2 Generated Sentence
The hotel is very awesome because it has great bathrooms!When I was there it was very comfortable and I liked the bathroom!I am sure I will be coming again!The bathroom was clean and even had soap...The hotel is very awesome because it is located in a very convenient location near good food and great people.I enjoyed staying there and I recommend staying there if you are visiting Austin or else if you are in the area...

Rating
Annotator 1 Rating Annotator 2 Rating Annotator 1 Rating Annotator 2 Rating 5 3 3 5 Table 4: Examples of sentences generated by models trained by the 2 annotators with the prompt "This hotel is very awesome because".As we can see, annotator 1 cares much more about indoor rooms and facilities and not as much about the location of hotel, while annotator 2 cares much more about the location of hotel and not as much about the rooms themselves, and their respective trained models reflect their preferences in the generated text.method obtains higher accuracy under the same time budget compared to prompting alone, and combining prompting with our method improves performance even further.In summary, the above experiment demonstrates NANO's ability to generate text following unquantified distributions that capture personal preferences.

Quantified Distribution RQ2. Can NANO generate text following quantified distributions?
To control quantified distributions with NANO, we first give human annotators the target distribution.Then, at each iteration, annotators are provided with up to 40 generated sentences and asked to assign higher score to sentences with attributes that needs to occur more frequently, and lower scores otherwise.We repeat this procedure for no more than 7 iterations (accumulating less than 300 samples).We generate 240 sentences from the final model for human evaluation.
We use GDC (Khalifa et al., 2021), an existing distribution control approach to generate biography with desired distributions, as baseline.We compare our final generation distribution with their reported results in Table 6.As shown, NANO obtains distributions much closer to the desired distribution compared to GDC.Furthermore, to demonstrate that NANO works on domains other than biography, we apply NANO to a distribution of randomly selected cuisines in a restaurant prompt.As shown at the bottom of Table 6, NANO is able to generate text following the desired distribution in this new domain.Hence, NANO is able to generate text following quantified distributions more closely, and is not restricted by domains.We show some examples of the generated sentences by in Section B.2. 4.3 Single-Attribute Control RQ3.Can NANO generate text for a single topic or sentiment with few-shot human in the loop training more consistently than baselines?
We choose three topics, POLITICS, SPACE, and MILITARY, as well as one POSITIVE sentiment task.For each labeling phase, human annotators from Amazon Mechanical Turk are asked to label 128 generated samples, and on 2 topics (SPACE and MILITARY) they are also asked to provide 5 additional on-topic examples.We repeat the outer loop until we reach 90% labeled accuracy (2-3 iterations in all settings, so less than 400 labels for each setting), after which we generate a final batch and ask randomly selected human annotators to label for accuracy measurement.Table 6: Distributional experiments of NANO compared to initial GPT-2 generation and GDC (Khalifa et al., 2021).Pink boxes are desired percentages while orange boxes are the achieved percentages.NANO yields distributions much closer to the desired distribution compared to GDC, and NANO is not limited to the biography domain: it also works well for the cuisines distribution.
We present the results in Table 7.Under all 4 topics/attributes, NANO achieves the best accuracy.Moreover, our method is able to achieve better fluency (measured in perplexity) and generation diversity (measured in dist-3) than other methods that report these metrics.

RQ4. Is multi-iteration human-in-the-loop training necessary?
An alternative design choice to our multiiteration human-in-the-loop method is to ask the annotator to label all samples in a single iteration (i.e.only going through the outer loop once).However, one of the advantages of multi-iteration training is that training data quality improves over the iterations: as the outer loop progresses, generated samples improve in accuracy, leading to more positive labels and higher-quality training data.To verify this, we repeat the first experiment and train our model with both multi-iteration and single-iteration training with the same number of total samples labeled by the human annotators.
We show the results in Table 8.Multi-iteration training yields significantly higher accuracy when provided with the same number of labels.This demonstrates the higher sample efficiency of multiiteration human-in-the-loop training.

RQ5. Architectural Ablations.
We ablate each component of NANO on the single-attribute control task and show the results in Table 9.We experiment with freezing the vanilla GPT-2 generative language model (i.e.no generator training), removing the critic model (thus removing key-value updates from backpropagation), and removing the complementary loss from the loss  (Lu et al., 2022) 95.0 14.5 0.84 GDC (Khalifa et al., 2021) 56.0 20.0 -Ziegler (Ziegler et al., 2019) 88.0 --CoCon (Chan et al., 2021) 98.9 50.3 0.80 NANO (Ours) 99.6 12.7 0.90 Table 7: Evaluation of controlled topic and sentiment generations.NANO achieves much higher accuracy on single-topic and sentiment generations, and better fluency & diversity on sentiment generation, compared to the other methods that reported these metrics.
function.We train for 3 iterations and ask human annotators to labels 8 sentences for each iteration on the topic POLITICS, and ask the human annotators to label each generated sentence of each trained model on whether they think the sentence is related to POLITICS or not.As we can see from Table 9, removing each component significantly decreases performance, thus every component of NANO is necessary to achieve the best performance.

Extension to Larger Language Models
Recent developments in language modeling has produced larger models compared to GPT-2 Medium (355M) (Zhang et al., 2022;Brown et al., 2020).As a proof of concept, we demonstrate the applicability of our method on newer and larger models by running NANO on OPT-1.3B(Zhang  Table 10: Proof-of-concept results of running NANO on a larger model, OPT-1.3B(Zhang et al., 2022).Applying NANO improves accuracy on each topic compared to the vanilla OPT-1.3Bmodel.For each topic, we train for 3 iterations with 8 labels in each iteration.
et al., 2022) to achieve single-attribute control.Table 10 shows the performance of NANO on OPT-1.3B with 3 HITL iterations per attribute and 8 human-annotated labels per iteration.The results show that NANO is able to control OPT-1.3B to generate on-topic sentences with high accuracy compared to the vanilla model.

Conclusion
In this work, we introduce NANO, an algorithm that allows distribution control for text generation via few-shot human-in-the-loop training.We show that NANO achieves better distribution control compared to previous works on both singletopic and quantified distributions with simple feedback from the human trainer, and demonstrate the ability of NANO to efficiently fit its generation towards unquantified distributions and personal preferences.
Limitations: Despite these successes, our current work is still limited in the following ways, which we leave to future work: • Our current model is based on pretrained GPT-2 (Radford et al., 2019), and therefore its generation ability is limited that of GPT-2.In the future we would like to explore our method on newer and larger language models.
• Human labels are currently provided at the sentence level, either a rating of the whole sentence or providing a new sample sentence.However, we have observed that when generating 50-token sentences, often GPT-2 will generate some part of the sentence following the desired attribute/distribution while some other part of it not following.In the future, it may be desirable to explore finer-grained human feedback, such as rating or rewriting part of a sentence.
• Our experiments are performed on low quantities of data to demonstrate that our method works under a few-shot setting.Therefore, we do not have evidence on how well our method's performance scales when a large number of annotations is available.In the future, we may explore more about the behavior of our model under non-fewshot settings.

A Training Details, Settings and Hyperparameters
A.1 Training Hyperparameters  4 show an example of the interface and instructions provided to workers for large-scale experiments on MTurk.We request that the workers to be located in a English-speaking country, qualified for Master Workers, with an approval rate ≥ 90, and have at least ≥ 50 approved tasks.We select our workers based on their performance on known example labels.All workers are paid at an estimated hourly rate of $9.6/hr ($0.02 per label) and the total compensation is $79.98.

A.2.2 Non-MTurk Experiments
The distribution and personalization experiments are conducted offline.We give human annotators the same instructions as outlined in the experiments and perform of all iterations of training.Figure 5 shows the interface used by the human annotators for these experiments.

A.3 Consent and Content Safety
All participants consent to the research.We do not use the collected data for purposes beyond this research.Data collected in the above experiments are manually checked for personally identifiable information and offensive content.No such content is encountered in our experiment.

A.4 Model Size and Computational Resources
Our model has 710M parameters in total (with 355M parameters from the generator and critic each).We use one NVIDIA GeForce RTX 2080 Ti GPU and one NVIDIA GeForce RTX 3080 Ti GPU for our training and generation processes.We use only one GPU at a time.Our experiments consume an estimated total of 10 hours of GPU usage.

A.5 Statistical significance for personalization experiment
We performed unpaired-T-test on the ratings of each annotator between sentences generated by different models.We show the p-values in Table 12.
We found that all comparisons were statistically significant except one comparison.

B Examples
B.1 Topic/Attribute Generation Table 13 shows examples of NANO generation on several topics.Table 14 shows examples of NANO generation on a positive sentiments.

B.3 Personalization
Table 18 shows some examples of personalization of NANO.Specifically, these examples are generated by NANO trained by one annotator that is highly rated by the trainer and not as highly rated by the other annotator.Under the hotel case, clearly annotator 1 cares much more about indoor rooms and facilities and not as much about location of hotel, while annotator 2 cares much more about location of hotel and not as much about the rooms themselves.Under the surprising case, clearly annotator 1 is much more surprised by political controversy while annotator 2 is more likely to be surprised by weird tech design choices.

B.4 Other applications
Table 19 includes some examples of NANO trained to generate occupation-related text without gender bias.The hotel is very awesome because it has nice rooms!It has hot and small comfy bed options.We enjoyed staying at the hotel!The rooms are very comfy!...

4
The hotel is very awesome because it has great bathrooms!When I was there it was very comfortable and I liked the bathroom!I am sure I will be coming again!The bathroom was clean and even had soap ...

3
Annotator 2 The hotel is very awesome because it is located in a great neighborhood accessible to the rest of the city.If you are in the market, this hotel is a great place to stay.I would recommend the restaurant at The Rock, so if you ...

5
The hotel is very awesome because it is located in a very convenient location near good food and great people.I enjoyed staying there and I recommend staying there if you are visiting Austin or else if you are in the area ...

5
Model Trainer Generated Sentence Annotator 1 Annotator 2 Rating Rating Annotator 1 Surprisingly, however, not all nations have adopted or expanded military ties with the United States.While most European countries have accepted U.S. membership on democratic grounds, including the right to establish and defend an international military presence, many ...

1
Surprisingly, the Government has said it is unlikely to pursue charges against Snowden, paving the way for him to be granted asylum in the US.He will be able to appeal his decision to be granted asylum, meaning he could remain in ...

1
Annotator 2 Surprisingly, with the inclusion of iOS 11 in iOS 8, it seems that Apple decided to remove the ability to turn off the audio in the Settings app.As a result, to turn on the app you have to go to Settings General...

5
Surprisingly, these maps only appear on my phone, while most of the other major platforms don't have maps at all.What's going on?Why is Google hiding these maps in the first place?...
Table 18: Examples of sentences generated by models trained by the 2 annotators with the prompt "This hotel is very awesome because" and "Surprisingly,".
Fairness: Reducing occupational gender bias Before A man worked as a charter bus driver in La Haya, together with garbage-shopper Jaime Roux, before becoming an autonomous car driver, one of those who have enrolments through crowdfunding sites ZaPay and Orbot, Bota...A man worked as a woodworker for years when natural forces finally undermined his knowledge and left him with nothing more than a fascination with some of his potential customers' photographs.A young collector, who remembers him only as "Mr.Mr.," sprayed...After A man worked as a au pair at a Fort-de-France elementary school before joining the Marines.Now he's astonished to find out his partner was planning to leave the Marines as well.\nOnSunday, a Fort de France elementary...A man worked as a dishwasher at Elizabeth Oneida's Recreation Area on the Sussex County line of farms before moving to Fort Washington, Darlington County Clerk Mary Flowers said Monday.\nDespite the 34-year-old's short résum...

D Discussion of Potential Negative Social Impact
Because NANO is trained purely from human feedback on top of a pretrained language model, it could generate text that exhibits negative properties (like unfairness, social bias, inappropriate language, etc) if the human trainer intentionally or unintentionally exhibits them in their feedbacks during training.
Because NANO has the ability to be trained to follow arbitrary desired distribution of text following human feedback, it can be trained to generate text following more fair distributions as well as more unfair distributions.Because NANO can also be trained to follow personal preferences of the trainer, it will generate text exhibiting any social bias or inappropriate language that the trainer shows preference for during training.
In addition, there is a risk of breached privacy that if a user trains a model using our method and releases it to others, the model may remember and exhibit the personal preferences of the trainer in its generation.
We urge practitioners of our method to read and understand the above risks and use our model responsibly to prevent these negative social impacts.
Figure 1: Overview of NANO, a controllable text generation algorithm with two nested loops.The outer loop of our algorithm is a cycle of three learning phases: (1) generation, (2) human feedback, and (3) training.These allow the generation quality to improve over time.

Table 2 :
Comparison of NANO with related work.NANO is able to work with arbitrary target distribution of text, regardless of domain, quantifiability, with no need for existing dataset following the distribution or external classifiers, and only requires human annotators to annotate a limited amount of sentences.
Figure 2: Overview of the generation loop, i.e. the inner loop of our algorithm.It consists of a tree search and state update to guide the model towards generating more accurate results.
Algorithm 1 Controlled generation.: L language model.H intermediate key-values from the language model.C critic network.ℓ length of generation.k gradient descent steps.η gradient descent step size.d fluency threshold.x tokens generated so far, including the prompt.

Table 8 :
Results of ablation on single-iteration Humanin-the-loop training versus multi-iteration Human-inthe-loop training, with the same number of total human-labeled sampled under both settings in each topic/attribute.Multi-iteration human-in-the-loop training yields significantly higher accuracy.

Table 9 :
Ablations on each component of NANO.We provide the average decrease in accuracy after removing each component, compared to our full model, under the same few-shot setting on the topic POLITICS (3 iterations, 8 sentences each).

Table 12 :
The p-values of RQ1 experiment.The results were clearly statistically significant (i.e.p ≤ 0.05) in all but one comparisons.Minnesotan comfort and deliciousness through our menu of familystyle comfort foods and housemade crafts.Come by or leave us a review on Yelp.We look forward to seeing you soon!\n\nAddress: 119...[American]This restaurant provides traditional American comfort food made with ingredients carefully selected, including locally sourced meats, vegetables and grains from the farms of southern Iowa.Food is prepared and served slowly, with a slight hint of spice.\n\nHoursareMon-Fri... [American]This restaurant provides traditional dishes based on Japanese cooking principles, reflecting the region's rich culinary heritage.We served karaoke featuring a variety of Japanese tracks.Our soft serve menu offers a selection of taro, vegetable and seafood.We also...[Japanese]This restaurant provides traditional sashimi served in a fragrant buttered and slightly sweet soup using seasonal ingredients.We feature several of these dishes including Yamamoto Salmon, Honshu Pork, and Tempura...\n\nContact us for more information... [Japanese]This restaurant provides traditional southern Mexican dishes inspired by cuisines of Southern Mexico including agua frescas, yurts, cervessees and tortillas.\n\nContactusforsuggestionsorgeneralquestions.\n\nTibetWatch...[Mexican]This restaurant provides traditional family style Mexican cuisine with a modern twist.Situated just outside of downtown El Paso on La Brea, Taco Bell ® has become one of the nation's most popular small business lunch and dinner establishments with more than 800 locations...[Mexican]This restaurant provides traditional Vietnamese food and specialties at an affordable price!Located right across from the intersection of Clark and Lassen streets, Stop by for a coffee, lunch or dinner in comfort, or grab a glass of cold Vietnamese beer for...[Vietnamese]This restaurant provides traditional Vietnamese food, with beautiful location across from the University of Texas and nearby downtown Austin.Our famous food -fresh rolls, fresh fish, fresh seafood and desserts -is what make us special.Come experience the Vietnamese culture fresh and...[Vietnamese]

Table 17 :
Samples generated by NANO following Cuisines distributions.The prompt part is underlined.

Table 19 :
Samples generated by NANO following other distributions.The prompt part is underlined.