Language Models are Few-Shot Butlers

Pretrained language models demonstrate strong performance in most NLP tasks when fine-tuned on small task-specific datasets. Hence, these autoregressive models constitute ideal agents to operate in text-based environments where language understanding and generative capabilities are essential. Nonetheless, collecting expert demonstrations in such environments is a time-consuming endeavour. We introduce a two-stage procedure to learn from a small set of demonstrations and further improve by interacting with an environment. We show that language models fine-tuned with only 1.2% of the expert demonstrations and a simple reinforcement learning algorithm achieve a 51% absolute improvement in success rate over existing methods in the ALFWorld environment.


Introduction
Over the past few years, successive generations of language models (Radford et al., 2018(Radford et al., , 2019Brown et al., 2020) have reshaped the way we approach Natural Language Processing problems. These Transformer-based (Vaswani et al., 2017) networks scale to ever-increasing amounts of parameters, data and compute (Kaplan et al., 2020) while demonstrating impressive transfer-learning capabilities across a wide variety of benchmarks (Wang et al., 2018(Wang et al., , 2019Rajpurkar et al., 2018;Reddy et al., 2019).
However, the application of modern language models to action generation in text-based environments (Côté et al., 2018;Hausknecht et al., 2019;Shridhar et al., 2021) remains largely unexplored (Yao et al., 2020). Intuitively, large-scale models leveraging strong linguistic priors should thrive in settings where observations and actions are textual (Luketina et al., 2019), but several issues arise. Indeed, agents must conform to an environment's generative grammar and gameplay specificities making Goal: Rinse the egg to put it in the microwave. Obs: Looking quickly around you, you see a cabinet, a garbagecan, a coffeemachine, [...], a stoveburner, a sinkbasin and a microwave. Action: go to sinkbasin Obs: You arrive at sinkbasin. You see a butterknife, a potato, a spoon and a tomato. Action: go to garbagecan Obs: You arrive at garbagecan. You see an egg. Action: take egg from garbagecan Obs: You pick up the egg from the garbagecan. Action: go to sinkbasin Obs: You arrive at sinkbasin. You see a butterknife, a potato, a spoon and a tomato. Action: clean egg with sinkbasin Obs: You clean the egg using the sinkbasin. Action: go to microwave Obs: You arrive at microwave. The microwave is closed. Action: open microwave Obs: You open the microwave. Action: put egg in/on microwave off the shelf transfer fail in most games. A way to overcome this problem is to acquire expert demonstrations and resort to the widely used paradigm of fine-tuning on task-specific data (Howard and Ruder, 2018;Radford et al., 2018;Devlin et al., 2019). Nevertheless, collecting demonstrations in text-based environments requires far more time and expert knowledge than for most NLP tasks. A single demonstration includes tens of actions taken over a long time horizon to solve multiple subgoals.
In this work, we propose a two-stage procedure to address these issues and develop language models acting as agents in text-based environments.
First, we train language models to imitate a few dozens of expert demonstrations in order to respect an environment's grammar and acquire basic gamesense. Second, we let the models interact with the environment and iteratively treat successful trajectories as additional expert demonstrations for further fine-tuning. We demonstrate the effectiveness of our approach in the recently introduced ALFWorld environment (Shridhar et al., 2021) 1 , which was designed with an extensive set of tasks and expert demonstrations.
In summary, our contributions are the following: 1. We show that language models fine-tuned on thousands of expert demonstrations considerably outperform current methods in the ALF-World environment.
2. We achieve strong results with a fraction of the demonstrations by combining imitation and reinforcement learning algorithms.
3. We illustrate the robustness of the models developed to human-annotated goals in realistic scenarios.

Background: goal-based textual environments
A goal-based textual environment can be represented as a partially observable Markov decision process P = (S, O, A, G, R, T, M ) where observations, actions and goals are specified in natural language. In state s t ∈ S, an agent takes action a t ∈ A conditioned on context c t = (g, o 0 , a 0 , ..., o t ). It receives reward r t = R(s t , a t , g), which is an indicator variable for the completion of goal g ∈ G, and a new observation o t+1 = M (T (s t , a t )), where M : S → O is a mapping from states to observations and T : S × A → S is the transition function.

Learning from demonstrations
A demonstration d consists of a sequence of observations and actions (o 0 , a 0 , o 1 , a 1 , ..., o T , a T ) for reaching goal g based on contexts (c 0 , c 1 , ..., c T ).
We consider a dataset D of N demonstrations. A parameterized model p θ is trained to minimize the 1 ALFWorld aligns both text and embodied environments, but here we only refer to the text environment. (1) As noted by Yao et al. (2020), where a j is the j-th token generated in action a of length m.
We use a per-demonstration loss instead of a per-action loss to reduce computational costs. Indeed, with this formulation a Transformer-based autoregressive model can leverage previous computations when considering a new context from the same demonstration. In addition, early experiments suggested that a per-demonstration loss does not harm performance.
We call action modeling the process of minimizing the mean demonstration loss, which is conceptually very similar to language modeling, except that we only maximize the likelihood of action tokens instead of maximizing the likelihood of the full trajectory c T .

Learning from interactions
While action modeling is a powerful training objective, a model learning from demonstrations is ultimately limited by the size of the training set. To circumvent this issue, we propose an iterated action modeling (IAM) algorithm: 1. A language model pretrained on expert demonstrations is tasked to solve a batch of goals in the environment.
2. The language model is further fine-tuned with action modeling on successful trajectories.
The key advantage of this algorithm is that we can easily combine imitation learning with reinforcement learning since we are optimizing the same objective over two distinct sources of data: demonstrations and successful attempts. Moreover, the extensively pretrained language/action modeling head is kept during reinforcement learning instead of initializing a new RL-specific head from scratch, which was shown to lead to better performance in NLP tasks (Gao et al., 2021).  Table 1: Success percentages per evaluation split (in-distribution and out-of-distribution) with and without humanannotated goals. GPT partial and GPT are GPT2-based models fine-tuned with action modeling on 42 and 3553 demonstrations, respectively. GPT partial corresponds to the former model subsequently trained with iterated action modeling in ALFWorld. Our results are averaged over 5 seeds. Standard deviations are upper bounded by 9 for GPT partial , 8 for GPT partial , and 3 for GPT.

Experiments
Experiments were implemented with the Transformers (Wolf et al., 2020)

Environment and dataset
ALFWorld (Shridhar et al., 2021) is a goal-based textual environment mirroring the embodied AL-FRED benchmark (Shridhar et al., 2020) with the TextWorld game engine (Côté et al., 2018). The environment was created with the aim of learning high-level language policies inside of it and transferring them to the embodied setting. ALF-World inludes 6 tasks that are compositional and require multiple sub-goals to be solved over various time horizons. Any string of words constitute a valid action making the action space unbounded and the training of a policy consequently difficult.
In total there are 3553 training task instances {tasktype, object, receptacle, room}, 140 in-distribution evaluation task instances (seen split) and 134 outof-distribution evaluation task instances (unseen split). A task instance specifies the type of the task to solve, the object to interact with, the receptacle where the object should be put and the room layout (e.g. {heat and place, egg, countertop, kitchen 12}). Besides, each training task instance in ALFWorld comes with an expert demonstration, enabling the development of imitation learning agents.

Training
We train two GPT2-medium (345M parameters) (Radford et al., 2019) models with action modeling on the set of demonstrations. The first model, GPT, has access to the full set of demonstrations while the second model, GPT partial , only has access to 42 demonstrations. GPT partial is subsequently trained with iterated action modeling in the environment and is then denoted as GPT partial . When interacting with the environment, models greedily decode actions token-per-token until an end of action token is reached. See Appendix A for training details.

Evaluation
We select model checkpoints according to their evaluation performance on the seen split and further evaluate them on the unseen split. During evaluation, we employ greedy action decoding and a sliding context window which depends on the maximum number of tokens the language models can handle. This implies that the contexts given to the models consist of the goal, the first observation and as many of the previous observations and actions as possible. We compare our models with the ones developed by Shridhar et al. (2021): • BUTLER: trained with Dagger (Ross et al., 2011) for 50k episodes and handling failed actions with beam search.
• Seq2Seq: trained with the full set of demonstrations.
Contrary to our approach, these models do not encapsulate prior linguistic knowledge except from pretrained word embeddings.

Robustness
In ALFWorld, goals follow a generative grammar specific to the environment, e.g. "put a hot apple in fridge". However, when interacting with autonomous agents, humans may formulate goals that deviate from this grammar, e.g. "warm up apple to put in fridge". The ability to generalize to human-annotated goals is quantitatively assessed with crowd-sourced goal annotations (Shridhar et al., 2020(Shridhar et al., , 2021. We evaluate the best performing models from Section 3.3 on the humanannotated seen and unseen splits.

Language models strongly outperform existing methods in ALFWorld
We report the entirety of the results in Table 1. GPT achieves success rates of 91% and 95%, respectively, on the seen and unseen splits. That is, absolute improvements of 81% and 86% over the Seq2Seq model trained on the same data. Even when compared to BUTLER, trained with 14 times more expert-guided demonstrations and manually handling failed actions, we observe absolute improvements of 51% and 58%. GPT partial is also competitive with BUTLER and outperforms the Seq2Seq model with only 0.07% and 1.2% of the expert demonstrations available. However, there remains a large performance gap between the two GPT2-based models.

Iterated action modeling retains most of the performance with few demonstrations
With iterated action modeling, GPT partial 's performance improves by 22% and 20%, respectively, on the seen and unseen splits. In other words, GPT partial retains 76% and 63% of GPT's results with only 1.2% of the expert demonstrations available.

Agents with linguistic priors are robust to human-annotated goals
Evaluation on the seen and unseen splits with human-annotated goals reveals that language models fine-tuned with action modeling on expert demonstrations and successful trajectories are capable of solving a large proportion of goals formulated in open-ended natural language. For example, GPT and GPT partial solve respectively 57% and 37% of human-annotated, out-of-distribution task instances. Figure 1 illustrates GPT partial solving one of these tasks. Yao et al. (2020) used language models to prune the action space in text-based games. The authors introduced the ClubFloyd dataset, which contains gameplay transcripts collected over a multitude of games, and fine-tuned a GPT2-small (117M parameters) (Radford et al., 2019) model on that dataset for action generation. This contextual action language model (CALM) was then queried to generate a small list of action candidates based on the last few observations and actions. CALM was combined with game-specific models trained with reinforcement learning (He et al., 2016) to pick the best action candidate among CALM's generations. This approach aims to transfer a general-purpose language model across multiple new environments without game-specific imitation or reinforcement learning. In our work, we optimize for performance instead of generalization by training language models with game-specific demonstrations and interactions. In fact, preliminary experiments with CALM in ALFWorld reveal that the model is unable to produce valid actions both in terms of grammar and task completion. Goal-conditioned supervised learning (Ghosh et al., 2021) treats every trajectory as an expert demonstration for reaching the final state encountered in that same trajectory. This hindsight goalrelabeling is possible because there exists a straightforward mapping between goals and states in the environments considered (i.e. the identity map). In ALFWorld, learning such a mapping is highly nontrivial and constitutes another research direction for extending existing methods (Cideron et al., 2019) to this environment. Therefore, during iterated action modeling we only consider successful trajectories as expert demonstrations and initialize the agent with a few demonstrations in order to start the RL procedure with a non-zero success rate.

Conclusion
We developed new agents for text-based environments with pretrained language models. These agents acquired game knowledge through demonstrations and interactions to drastically outperform current methods in the ALFWorld environment. While we investigated learning under the standard fine-tuning paradigm, more sophisticated ap-proaches could be explored (Schick and Schütze, 2020) and recent works (Brown et al., 2020;Zhao et al., 2021) even suggest that scaled-up and carefully calibrated models achieve great downstream results without requiring any parameter updates. Thus, in the near future one can imagine language models solving text-based environments with only a few demonstrations for priming.

A Hyperparameters and sample selection
We do not leverage any (potentially large) heldout set of demonstrations to tune hyperparameters or learning objectives. As mentioned in 3.3, we solely optimize the success rate over a small set of validation task instances that we can freely query rather than a validation loss on held-out examples. Hyperparameters for the action modeling and iterated action modeling experiments are displayed in Table 2 and   B Performance as a function of the number of training demonstrations In Figure 2, we provide a curve of model performance as a function of the number of training demonstrations for the action modeling stage. Around 168 demonstrations are necessary to achieve a success rate equivalent to that of GPT partial . In other words, adding the iterated action modeling procedure brings improvements similar to those we would get if we multiplied the number of demonstrations by 4.

C Input representation
In practice, a context is formed in the following way: 1. Append the goal to the first observation.