LEMON: Language-Based Environment Manipulation via Execution-Guided Pre-training

Language-based environment manipulation requires agents to manipulate the environment following natural language instructions, which is challenging due to the huge space of the environments. To address this challenge, various approaches have been proposed in recent work. Although these approaches work well for their intended environments, they are difficult to generalize across environments. In this work, we propose LEMON, a general framework for language-based environment manipulation tasks. Specifically, we first specify a task-agnostic approach for language-based environment manipulation tasks, which can deal with various environments using the same generative language model. Then we propose an execution-guided pre-training strategy to inject prior knowledge of environments to the language model with a pure synthetic pre-training corpus. Experimental results on tasks including Alchemy, Scene, Tangrams, ProPara and Recipes demonstrate the effectiveness of LEMON: it achieves new state-of-the-art results on four of the tasks, and the execution-guided pre-training strategy brings remarkable improvements on all experimental tasks.


Introduction
Building agents that can understand human language and accordingly manipulate the environment around them has been a long-standing goal of artificial intelligence (Winograd, 1971). Various tasks focus on this scene, including collaborative building (Narayan-Chen et al., 2019), state tracking (Dalvi et al., 2018;Tandon et al., 2020) and instruction following (Andreas and Klein, 2015;Long et al., 2016;Suhr et al., 2019). What these tasks have in common is that the agents are required to manipulate the environment based on the natural * Work done during internship at Microsoft Research Asia. 1 Our code is available at: https://github.com/ microsoft/ContextualSP In the pre-training stage, the input of LEMON includes an initial environment state and a program, and the goal environment state is served as the supervision. The finetuning stage is similar to the pre-training stage, except that the program in the model input is replaced by the natural language instruction.
language. To seize the commonality of existing tasks, we define such tasks as language-based environment manipulation (LEM) tasks. Generally, these tasks are challenging due to the large exploration space of the environment itself and the complexity of human-agent interactions. For example, in the environment shown in Figure 1, the agent needs to manipulate seven beakers with various colored liquids correctly according to the long instruction.
To address these challenges, recent work have proposed various specialized models to deal with different environments (Suhr and Artzi, 2018; Dalvi et al., 2018;Gupta and Durrett, 2019b;Tang et al., 2020). Although these models work well, they are difficult to generalize across environments since they contain environment-specific modules. For example, Suhr and Artzi (2018) design different encoder modules for different environments.
Different from previous work focusing on specialized models, we argue that with formulating LEM tasks as sequence generation problems, the family of generative language models (GLMs), such as BART (Lewis et al., 2020), can be an environment-generic agent for various environments. Taking advantage of GLMs, such a taskagnostic solution greatly reduces the difficulty of modeling different environments. However, GLMs generally lack prior knowledge of downstream environments since they have not seen even similar ones during pre-training. To unleash the power of GLMs in downstream environments, we argue that GLMs should be continually pre-trained to understand these environments, and the pre-training should engage GLMs to explore as much of the environment space as possible. We believe if GLMs can understand the environment well, they will more easily manipulate the environment with respect to human language.
Inspired by the above, in this paper, we propose LEMON (for Language-based Environment Manipulation via Execution-guided Pre-training), a general framework for LEM tasks. As shown in Figure 1, LEMON consists of two parts: 1) A task-agnostic approach that uses the same protocol to tackle different LEM tasks (right). 2) An execution-guided pre-training strategy, which injects prior knowledge about environments into the GLM (left). For the first part, we employ the popular BART (Lewis et al., 2020) as the model backbone, and take five representative tasks ALCHEMY, SCENE, TANGRAMS (Long et al., 2016), PROPARA (Dalvi et al., 2018) and RECIPES (Bosselut et al., 2018) as the testbed. For the pre-training part, it is to engage our model to explore the environment space. Considering that the environment space mainly consists of the state space (i.e., valid environment states) and the action space (i.e., possible actions to manipulate the environment), we suggest pre-training the model via synthesizing data involving these two spaces. Specifically, given an environment, we begin with randomly sampling its relevant initial states and programs 2 . With feeding the random initial state and the random program as input for LEMON, we leverage the goal state after executing the program as supervision for LEMON. Since the program execution is easy to carry out in symbolic environments 3 , our execution-guided pre-training is suitable for various symbolic envi-ronments. Meanwhile, since the random initial states and the programs can be sampled systematically, we can readily obtain a large-scale highquality pre-training corpus without human labeling or data cleaning. To the best of our knowledge, LEMON is the first work to explore pre-training in language-based environment manipulation. In summary, the main contributions of our framework LEMON are three-fold: • We suggest a task-agnostic approach that can be tailored to various environments. By formulating LEM tasks as sequence generation problems, our approach leverages one architecture to tackle them.
• We propose a novel execution-guided pretraining strategy, which can inject prior knowledge of environments by continually pretraining with only synthetic data.
• Experimental results on five tasks demonstrate that our task-agnostic approach is comparable or prior to previous systems, and our pretraining strategy further improves the performance by a significant margin (e.g., +4.1% on ALCHEMY). Finally, our approach achieves new state-of-the-art results on ALCHEMY, SCENE, PROPARA, and RECIPES.

LEMON Framework
We now discuss the LEMON framework ( Figure 1) in more detail. Specifically, we introduce the taskagnostic approach for language-based environment manipulation ( §2.1) and the execution-guided pretraining ( §2.2). As for LEMON instantiations for different tasks, we leave the descriptions to §3.

Task-Agnostic Approach for LEM
As mentioned in §1, the existence of environmentspecific modules makes previous models difficult to generalize across environments. To eliminate this issue, we propose a task-agnostic approach to tackle different environments.
Task Formulation An environment space consists of a state space and an action space. And a state can be further decomposed into a set of entities (e.g., beakers in ALCHEMY) and properties (e.g., colors in ALCHEMY). Generally, the goal of LEM tasks is to manipulate the environment state with natural language. Formally, given  an initial environment state S 0 , the goal of LEM tasks is to predict the goal environment state S based on the human language instruction I. In most cases, the LEM task is performed in an interactive manner, and there would be a sequence of context-dependent instructions. Again, given an initial environment state S 0 and a sequence of natural language instructions I = (I 1 , I 2 , ..., I T ), where T stands for the total number of instructions in one conversation, the goal turns to predict the goal environment states at each step as S = (S 1 , S 2 , ..., S T ).
In the following, we use the conversational formulation to illustrate.
Model Architecture With formulating LEM tasks as sequence generation problems, we leverage BART (Lewis et al., 2020), a powerful encoderdecoder language model, to generate the goal environment state token-by-token. Formally, at t-th step, the input to our model consists of three parts, namely the initial environment state S 0 , the history (I 1 , I 2 , ..., I t−1 ) and the current instruction I t . Following previous work (Liu et al., 2020), we directly concatenate the history and the current instruction to form I t = (I 1 , I 2 , ..., I t ), which contains all historical instructions. The final input to our model is the concatenation of S 0 and I t with a [SEP] token as a separator between them. The output is the corresponding goal environment state S t .

Execution-Guided Pre-training
We propose an execution-guided pre-training strategy to explore the environment space as much as possible through synthetic data. In the following, we will introduce the pre-training task and the pretraining corpus generation procedure in turn.
Pre-training Task As described in §1, to encourage the model to understand and explore the environment, LEMON adopts the program execution as the pre-training task. Formally, given a randomly sampled initial environment state S 0 and a randomly sampled program A, the model is pretrained to predict the goal environment state S, as shown in Figure 2. Such a pre-training task fulfills our expectations of both environment exploration and environment understanding, which can be explained from two aspects. From the input perspective, such a task involves all essential elements of an environment (i.e., state and action). Together with large-scale random sampling, it allows the model to fully explore the environment space. From the output perspective, such a task is challenging -the model must understand the environment to predict S correctly. Meanwhile, the program execution as a pre-training task is highly flexible. As shown in Figure 2, it works well for five different tasks. In the implementation of pre-training, we first concatenate S 0 and A using the same [SEP] token as a separator and then feed the concatenated sequence into LEMON. The pre-training supervision, i.e., the goal environment state S, is obtained from a task-dependent executor.
In LEM tasks, the executor is designed to interpret each task-dependent program and change the current environment state to another state accordingly. In practice, the executor can be easily implemented since the environments are symbolic. Pre-training Corpus Generation Unlike most pre-training work that employs web crawling to collect pre-training corpus, we synthesize the pretraining corpus directly by randomly sampling the environment states and programs. Compared to human language, high-quality environment states and programs are easier to sample since they are highly structured. As introduced above, each pre-training example contains a sampled initial environment state, a sampled executable program, and a goal environment state obtained from the executor. One by one, the pre-training corpus can be generated by repeating the sampling process. Concretely, for the initial environment state sampling, it can be achieved by randomly selecting a valid value for each property defined in the corresponding environment. As for the program sampling, a valid program can be generated by randomly selecting a valid function and then randomly sampling from all suitable parameters of the selected function. The valid values for each property and function will be discussed later.

LEMON Instantiations
To demonstrate the capabilities of LEMON, we apply our framework on five exemplary tasks, namely, ALCHEMY, SCENE, TANGRAMS, PROPARA, and RECIPES. Examples of each task are shown in Figure 2, including visualizations of the initial environment state and the goal environment state, as well as a schematic representation of the program. In this section, for each task, we elaborate the definition of the environment and the applied program to instantiate LEMON.

ALCHEMY
Environment State Definition The environment state in ALCHEMY contains seven beakers, each containing up to four units of colored chemicals. Each environment state contains three properties, including beaker IDs (from 1 to 7), liquid colors (brown, green, orange, purple, red, and yellow), and liquid amounts (from 0 to 4). Figure 2 shows an example, and the example initial environment state can be represented as 1:r|2:o|3:r|4:g|5:y|6:oo|7:r in text, where different letters represent different colors. Note that if a beaker does not contain any liquid, it can be represented by _. And | stands for the delimiter that splits the state of each beaker, which is also applicable for the following tasks.
Program Definition The action space of ALCHEMY contains three kinds of actions to manipulate the environment, namely, POUR, DRAIN and MIX. We use the program proposed by Guu et al. (2017), where the functions are the same as the actions defined in the environment. The detail program grammar is shown in Table 1.

SCENE
Environment State Definition The environment state in SCENE contains ten positions, with up to one person in each position. A person is defined by a shirt color and optionally a hat color. Formally, each environment state contains three properties, including position IDs (from 1 to 10), shirt colors (brown, green, orange, purple, red, and yellow), and hat colors (the same as shirt colors). As shown in Figure 2, the example initial environment state can be represented as 1:__|2:bp|3:__|4:oy|5:__ (only five positions are shown in Figure 2 for brevity) in text, where the first character in each position represents the shirt color and the second one represents the hat color. _ indicates either an empty position or a person without a hat. Note that the hat can only appear when the position is occupied.  Table 2: Experimental results on the test set of ALCHEMY, SCENE and TANGRAMS. Fully supervised approaches (in grey background) are the approaches that use annotated programs as labels, while weakly supervised approaches are the approaches that no golden program is provided. Although the comparisons are not fair, we report the results of fully supervised approaches for reference. Note that our ablation w.o. pre-training is identical to fine-tuning BART on the downstream task, and the same for Table 3 and Table 4.

TANGRAMS Environment State Definition
The environment state in TANGRAMS contains a list of up to five unique objects. Similarly, the environment state can be represented by the object indexes (from 1 to 5) and the object names (A, B, C, D, and E). For example, the initial environment state in Figure 2 can be represented as 1:A|2:B|3:C|4:D|5:E. The same object cannot appear in one environment state. If the number of objects is less than 5, we fill the sequence with _ to make it 5 in length.
Program Definition Three actions are involved to manipulate the TANGRAMS environment, namely, ADD, REMOVE and SWAP. And we use the program proposed by Suhr and Artzi (2018), which defines the functions including INSERT and REMOVE. Similar to the two kinds of programs mentioned above, permuting these two functions can achieve the goal of representing all actions defined in the environment.

PROPARA & RECIPES
Environment State Definition The PROPARA environment describes real-world scientific processes such as photosynthesis, erosion, etc. Each environment state in PROPARA contains a set of entity participants and their corresponding locations, and the locations vary with the natural language procedural text being described. Unlike the three environments mentioned above, the properties of an environment state in PROPARA are not fixed, but are dynamically constructed from the natural language text. Figure 2 shows an example, where the initial state stands for participants water, light, carbon are located in locations soil, sun, cloud respectively. The environment state in PROPARA can be naturally represented in key-value format. For example, the initial state in Figure 2 can be represented as ent:water|light|carbon loc:soil|sun|cloud, here ent: and loc: are special tokens that indicate the boundaries of entity participants and locations, respectively.

Program Definition
In the PROPARA environment, the procedural text describes four actions, namely, CREATE, MOVE, DESTROY and NONE. In practice, we use the program proposed by , in which the functions also contain CREATE, MOVE and DESTROY, which are aligned with the action space of PROPARA. As for RECIPES, the environment describes the state tracking process in the cooking domain. And the definition of the environment states and the programs are similar with PROPARA.

Experiments
In this section, we compare LEMON with baseline methods on the tasks discussed in §3 to demonstrate its effectiveness. Due to space limitation, we do not introduce these baselines below.  with denotation accuracy. In addition, the evaluation metrics can be divided into the denotation accuracy of a single instruction (Inst), of the first three instructions (3utts), and of the complete interactions (5utts).

Experimental Setup
We use BART-Large in fairseq (Ott et al., 2019) to implement LEMON. During pre-training, we synthesize 1 million pre-training examples for each experimental task. The learning rate is set to 3×10 −5 in all experiments of pre-training and fine-tuning. During pre-training, the maximum training step is set to 10, 000 for ALCHEMY, SCENE, TANGRAMS and 2, 000 for PROPARA and RECIPES, while the batch size is set to around 1, 000 for all tasks. During fine-tuning, the maximum training step is set to 10, 000 for all tasks, while the batch size is set to 64 for ALCHEMY, SCENE, TANGRAMS and 32 for PROPARA and RECIPES, respectively.  not only achieves new state-of-the-art performance among weakly supervised approaches, but also comes close to the performance of fully supervised approaches that leverage extra annotated programs. Moreover, the results also show that the execution-guided pre-training brings significant improvements (e.g., 4.1% of ALCHEMY in the 5utts denotation accuracy), which demonstrates that our pre-training strategy provides considerable prior knowledge for LEMON.

ALCHEMY & SCENE From
TANGRAMS Similarly, the results on TAN-GRAMS in Table 2 show that our execution-guided pre-training strategy improves LEMON by 3.3% in the 5utts denotation accuracy, further illustrating the effectiveness of our approach. Nevertheless, LEMON does not perform as well compared to previous state-of-the-art method (Suhr and Artzi, 2018). We suppose this is because Suhr and Artzi (2018) carefully model the historical instructions, while LEMON directly concatenates them. We leave the fine-grained context modeling of our approach for future work.
PROPARA  of 72.2%, which is 1.7% higher than the previous best-performing system REAL (Huang et al., 2021) and 1.8% higher than KOALA (Zhang et al., 2021). Note the improvement is highly non-trivial since KOALA leverages external knowledge, which indicates that the prior knowledge LEMON learns during pre-training is more effective than external knowledge. Similarly, the execution-guided pretraining brings a 3.2% improvement, which again demonstrates that the pre-training in LEMON can significantly facilitate the interaction procedure between natural language and environments. Table 4 shows the experimental results of the RECIPES task that LEMON reach stateof-the-art performance and surpass previous bestperforming systems (Huang et al., 2021) with a large margin by 7.0%. Besides, the proposed execution-guided pre-training also brings a 2.7% improvement. These results further illustrate the effectiveness of LEMON.

Pre-training Analysis
Scaling up pre-training has a positive impact Previous work (Lewis et al., 2020) has shown that the scale of the pre-training corpus is an important factor in pre-training, and thus we analyze the effect of our pre-training scale on downstream tasks. Figure 3 shows the performance of downstream tasks with respect to the size of the pre-training corpus, which are obtained from the validation set of each task. As seen, the performance of the model generally improves by scaling up the pre-training corpus, consistent with previous observations on pre-training (Liu et al., 2021).

Improvements do not come from data leakage
Since the pre-training corpus of LEMON contains various randomly sampled environment states, this may raise the doubt that the improvements of LEMON is due to the data leakage, that is, LEMON has seen some environments in the downstream val-

Environment (Initial State):
The difference lies in the third beaker, (w.o. pre-training) versus (w. pre-training). Without pre-training, the model does not correctly understand the semantics of "add", i.e., removing the liquid from the third beaker. Instruction: Empty out the first beaker, add the orange chemical to the red. Instruction Completeness (18.3%)

Environment (Initial State):
The difference lies in the first beaker, (w.o. pre-training) versus (w. pre-training). As observed, without pre-training, the model seems to ignore the instruction "throw out one unit of the second beaker". Instruction: Throw out one unit of the second beaker, pour the second beaker into first one.

Environment (Initial State):
The whole states are (w.o. pre-training) versus (w. pre-training). Without pre-training, the model does not find the correct beaker according to the instructions. Instruction: Pour out one part of the second yellow beaker. idation sets. Although we have already ensure that the pre-training corpus does not contain the environment states in the validation set and the test set, it is still interesting to investigate the potential impact of data leakage on LEMON. To analyze the effect, we create a validation corpus of size 40, 000 for each task, which contains only the environment states in the validation set, and then merge the cases from the validation corpus into the pre-training corpus with certain ratios (denoted as overlap ratio). Figure 4 shows the box plot of the relative performance to the reported performance (vertical axis) with respect to the overlap ratio (horizontal axis). We can observe that from the perspective of the vertical axis, the vertical axis of the densest area is near 0, indicating that the two variables are irrelevant, thus proving that the effectiveness of LEMON hardly relies on the overlap between corpus size and validation set. For the detailed downstream performance with respect to the overlap ratio, please refer to Appendix C.
Improvements come from prior knowledge acquirement To show what LEMON obtain from the execution-guided pre-training procedure, we manually analyze examples in the validation set where predictions are wrong before pre-training and correct after pre-training. Table 5 shows the main types of the improvements caused by the execution-guided pre-training. We can see that with the execution-guided pre-training, LEMON successfully masters the prior knowledge of different environments. Specifically, LEMON can manipulate the environments better, as reflected in the correctness of operations, the completeness of instructions, and the correctness of grounding.

Related Work
Language-based Environment Manipulation The first line of our related work is the previous work on LEM tasks. According to the output, existing methods on LEM tasks can be mainly divided into two categories: program prediction and state prediction. Prior work always treat the LEM task as a program prediction problem (Long et al., 2016;Guu et al., 2017;Suhr and Artzi, 2018;Fried et al., 2018;Huang et al., 2019;Yeh and Chen, 2019;. However, these approaches are environment-dependent and cannot be easily adapted to other environments. Besides, they either rely on natural language-program pairs as supervision or require complex heuristic rules, which is costly. Recent approaches generally treat the LEM task as a state prediction problem by predicting the goal state directly (Dalvi et al., 2018;Du et al., 2019;Das et al., 2019;Tang et al., 2020;Rajaby Faghihi and Kordjamshidi, 2021;Zhang et al., 2021). These models can eliminate the data collection issue, but require complex models designed to meet the needs of different kinds of environments. Compared with the above work, LEMON has the following advantages: 1) The proposed taskagnostic approach does not require additional annotations and is easy to generalize across different environments. 2) The proposed execution-guided pre-training strategy can further improve the model performance with synthetic data only.

Program Execution
The second line of our related work is the execution-guided work, of which the most related work are ProTo (Zhao et al., 2021) and TAPEX (Liu et al., 2021). ProTo learns to execute given programs on the observed task specifications, which focuses on following a given program to perform the corresponding task. Different from ProTo, LEMON focuses on pre-training with program execution to enhance the downstream performance. Following a similar idea, TAPEX (Liu et al., 2021) improves the table pre-training by learning SQL execution over tables. The main difference between TAPEX and LEMON is that TAPEX choose SQL execution as the pre-training task, which is suitable for a single environment only. While, LEMON is more flexible, and enables us to systematically design the pre-training task and synthesize pre-training corpus based on environment properties, and proven effective on multiple environments.

Conclusion & Future Work
In this work, we propose LEMON, a general framework for language-based environment manipulation tasks that not only models different environments using the same protocol, but also injects prior knowledge of environments into our model. Experimental results on five tasks demonstrate the effectiveness of LEMON: the execution-guided pretraining strategy brings significant improvements on all of them and LEMON achieves the state-ofthe-art performance on four of them. For future work, we hope to extend our approach to more complex environments and tasks such as image editing (Fu et al., 2020) and text editing (Faltings et al., 2021).

Limitations
The main limitation in this paper is that LEMON focus on symbolic environments instead of raw environments with only visual features. Compared to the latter, the former can be represented by semantic symbols, and thus enjoys better controllability and interpretability. We leave the exploration of raw environments for future work.

Ethics Statement
In this paper, we propose LEMON, a general framework for language-based environment manipulation tasks, consisting of a task-agnostic approach and an execution-guided pre-training strategy. We conduct experiments on five benchmarks, namely, ALCHEMY, SCENE, TANGRAMS, PROPARA, RECIPES.   Table 8 and Table 9 show the grammar rules of used programs in each domain. C Downstream Performance w.r.t Overlap Ratio Table 11 shows the downstream performance on the validation sets with respect to the overlap ratio in the pre-training corpus.

D Pre-training Improvement Analysis
The main types of the improvements by the execution-guided pre-training on the five tasks are shown in Table 12 and Table 13. Table 14 shows the examples of each domain, including the initial environment state, the program, and the corresponding goal environment state.

E Example Program, Initial State and Goal State of Each Domain
F Case Study Figure 5 shows two cases in ALCHEMY and SCENE, providing a more intuitive view of the role played by the execution-guided pre-training in LEMON. We display the initial environment states, the natural language instructions, and the goal environment states predicted with / without applying the execution-guided pre-training strategy, respectively. In the first case (a), when pouring yellow liquid from the 5-th beaker into the 3-th beaker, the latter receives red liquid, which is clearly an inconsistent change. However, with pre-training, LEMON can predict the correct goal environment state via deeply understanding the actions conveyed Pour the liquid from the first beaker to the second beaker. ⟨drain⟩ → DRAIN (⟨beaker⟩, ⟨integer⟩) | Pour out the ⟨integer⟩ unit from the ⟨beaker⟩ beaker. DRAIN (⟨beaker⟩, ⟨fraction⟩) Pour ⟨fraction⟩ of the liquid out of the ⟨beaker⟩ beaker. ⟨beaker⟩ → BEAKER (⟨index⟩) | The ⟨index⟩-th beaker. BEAKER (⟨index⟩, ⟨color⟩) The ⟨index⟩-th beaker of ⟨color⟩ color. ⟨index⟩ → 1 | 2 | · · · | 7 | −1 | −2 | · · · | −7 The index of the certain components in the environment. ⟨color⟩ → r | g | o | p | y | b The symbols corresponding to the color red, green, orange, purple, yellow and brown. ⟨integer⟩ → 1 | 2 | 3 | 4 The unit of the liquid.
The percentage of the liquid. Remove the person on the ⟨index⟩-th position. ⟨hat⟩ → HAT (⟨index⟩, ⟨color⟩) Add a hat of ⟨color⟩ color for the person on the ⟨index⟩-th position. ⟨rmhat⟩ → RMHAT (⟨index⟩) Remove the person's hat on the ⟨index⟩-th position.
The index of the certain components in the environment.
The symbols corresponding to the color red, green, orange, purple, yellow and brown. Remove the object at the ⟨index⟩ position. ⟨index⟩ → 1 | 2 | 3 | 4 | 5 The index of the certain components in the environment. ⟨object⟩ → A | B | C | D | E The object name.    The difference lies in the first beaker, orr (w.o. pre-training) versus or (w. pre-training). As observed, without pre-training, the model seems to ignore the instruction "throw out one unit of the second beaker". Instruction: Throw out one unit of the second beaker, pour the second beaker into first one. The difference lies in the second position and the third position, b_, yo (w.o. pre-training) versus bo, y_ (w. pre-training). The model requires to swap hats twice, but without pre-training, only once performed, which indicates that one of the TRADE-HATS operations is ignored. Instruction: A person in a yellow shirt enters from the right, the person in yellow takes the hat from the person in blue, the person in blue retrieves the hat from the person in yellow.