A Minimal Approach for Natural Language Action Space in Text-based Games

Text-based games (TGs) are language-based interactive environments for reinforcement learning. While language models (LMs) and knowledge graphs (KGs) are commonly used for handling large action space in TGs, it is unclear whether these techniques are necessary or overused. In this paper, we revisit the challenge of exploring the action space in TGs and propose \epsilon-admissible exploration, a minimal approach of utilizing admissible actions, for training phase. Additionally, we present a text-based actor-critic (TAC) agent that produces textual commands for game, solely from game observations, without requiring any KG or LM. Our method, on average across 10 games from Jericho, outperforms strong baselines and state-of-the-art agents that use LM and KG. Our approach highlights that a much lighter model design, with a fresh perspective on utilizing the information within the environments, suffices for an effective exploration of exponentially large action spaces.


Introduction
A goal-driven intelligent agent that communicates in natural language space has been a long goal of artificial intelligence.Text-based games (TGs) best suit this goal, since they allow the agent to read the textual description of the world and write the textual command to the world (Hausknecht et al., 2020;Côté et al., 2018).In TGs, the agent should perform natural language understanding (NLU), sequential reasoning and natural language generation (NLG) to generate a series of actions to accomplish the goal of the game, i.e. adventure or puzzle (Hausknecht et al., 2020).The language perspective of TGs foists environments partially observable and action space combinatorially large, making the task challenging.Since TGs alert the player how much the game has proceeded with the game score, reinforcement learning (RL) naturally lends itself as a suitable framework.
Due to its language action space, an RL agent in TGs typically deals with a combinatorially large action space, motiving various design choices to account for it.As two seminal works in this space, Yao et al. (2020) trained a language model (LM) to produce admissible actions2 for the given textual observation and then used, under the predicted action list, Deep Reinforcement Relevance Network to estimate the Q value.As an alternative, Ammanabrolu and Hausknecht (2020) constructs a knowledge graph (KG) to prune down action space while learning the policy distribution through actorcritic (AC) method and supervision signal from the admissible actions.Both paradigms leverage admissible actions at different stages at the cost of imposing additional modules and increasing model complexity.
In this paper, we take a fresh perspective on leveraging the information available in the TG environment to explore the action space without relying on LMs or KGs.We propose a minimal form of utilizing admissibility of actions to constrain the action space during training while allowing the agent to act independently to access the admissible actions during testing.More concretely, our proposed training strategy, -admissible exploration, leverages the admissible actions via random sampling during training to acquire diverse and useful data from the environment.Then, our developed textbased actor-critic (TAC) agent learns the policy distribution without any action space constraints.It is noteworthy that our much lighter proposal is under the same condition as other aforementioned methods since all the prior works use admissible actions in training the LM or the agent.
Our empirical findings, in Jericho, illustrate that TAC with -admissible exploration has better or on-par performance in comparison with the stateof-the-art agents that use an LM or KG.Through experiments, we observed that while previous methods have their action selections largely dependent on the quality of the LM or KG, sampling admissible actions helps with the action selection and results in acquiring diverse experiences during exploration.While showing a significant success on TGs, we hope our approach encourages alternative perspectives on leveraging action admissibility in other domains of applications where the action space is discrete and combinatorially large.

Basic Definitions
Text-based Games.TGs are game simulation environments that take natural language commands and return textual description of the world.They have received significant attention in both NLP and RL communities in recent years.Côté et al. (2018) introduced TextWorld, a TG framework that automatically generates textual observation through knowledge base in a game engine.It has several hyper-parameters to control the variety and difficulty of the game.Hausknecht et al. (2020) released Jericho, an open-sourced interface for human-made TGs, which has become the de-facto testbed for developments in TG.Admissible Action.A list of natural language actions that are guaranteed to be understood by the game engine and change the environment in TGs are called Admissible Actions.The term was introduced in TextWorld while a similar concept also exists in Jericho under a different name, valid actions.Hausknecht et al. (2020) proposed an algorithm that detects a set of admissible actions provided by Jericho suite by constructing a set of natural language actions from every template with detectable objects for a given observation and running them through the game engine to return those actions that changed the world object tree.
Template-based Action Space.Natural language actions are built with template (T) and object (O) from template-based action space.Each template takes at most two objects.For instance, a templateobject pair (take OBJ from OBJ, egg, fridge) produces a natural language action take egg from fridge while (west,-,-) produces west.The agent ought to find the optimal policy that maximizes the expected discounted sum of rewards, or the return, R t = ∞ k=0 γ k r t+k+1 .Traditional Reinforcement Learning.There are three traditional algorithms in RL, Q-learning (QL), policy gradient (PG) and actor-critic (AC).QL estimates the return for a given state-action pair, or Q then selects the action of the highest Q value.However, this requires the action space to be countably finite.To remedy this, PG directly learns the policy distribution from the environment such that it maximizes the total return through Monte-Carlo (MC) sampling.AC combines QL and PG, where it removes MC in PG and updates the parameters per each step with estimated Q value using QL.This eliminates the high variance of MC as an exchange of a relatively small bias from QL.

Related Work on TG Agents in RL
We provide a brief overview of widely known TG agents relevant to the work presented in this paper.We empirically compare these in the Section 5.1.Contextual Action LM (CALM)-DRRN (Yao et al., 2020) uses an LM (CALM) to produce a set of actions for a given textual observation from the TGs.It is trained to map a set of textual observations to the admissible actions through causal language modeling.Then, Deep Reinforcement Relevance Network (DRRN) agent was trained on the action candidates from CALM.DRRN follows QL, estimating the Q value per observation-action pair.As a result, CALM removes the need for the ground truth while training DRRN. 3nowledge Graph Advantage Actor Critic (KG-A2C) (Ammanabrolu and Hausknecht, 2020) uses the AC method to sequentially sample templates and objects, and KGs for long-term memory and action pruning.Throughout the gameplay, KG-A2C organizes knowledge triples from textual observation using Stanford OpenIE (Angeli et al., 2015) to construct a KG.Then, the KG is used to build state representation along with encoded game observations and constrain object space with only the entities that the agent can reach within KG, i.e. immediate neighbours.They used admissible actions in the cross entropy supervised loss.KG-A2C Inspired Agents.Xu et al. (2020) proposed SHA-KG that uses stacked hierarchical attention on KG.Graph attention network (GAT) was applied to sample sub-graphs of KG to enrich the state representation on top of KG-A2C.Ammanabrolu et al. (2020) used techniques inspired by Question Answering (QA) with LM to construct the KG.They introduced Q*BERT which uses AL-BERT (Lan et al., 2020) fine-tuned on a dataset specific to TGs to perform QA and extract information from textual observations of the game, i.e. "Where is my current location?".This improved the quality of KG, and therefore, constituted better state representation.Ryu et al. (2022) proposed an exploration technique that injects commonsense directly into action selection.They used log-likelihood score from commonsense transformer (Bosselut et al., 2019) to re-rank actions.Peng et al. (2021) investigated explainable generative agent (HEX-RL) and applied hierarchical graph attention to symbolic KG-based state representations.This was to leverage the graph representation based on its significance in action selection.They also employed intrinsic reward signal towards the expansion of KG to motivate the agent for exploration (HEX-RL-IM) (Peng et al., 2021).
All the aforementioned methods utilize admissible actions in training the LM or agent.Our proposed method, introduced shortly ( §4), uses admissible actions as action constraints during training without relying on KG or LM.
4 Text-based Actor Critic (TAC) Our agent, Text-based Actor Critic (TAC), follows the Actor-Critic method with template-object decoder.We provide an overview of the system in Figure 1 and a detailed description in below.We follow the notation introduced earlier in Section 2. Encoder.Our design consists of text and state encoders.Text encoder is a single shared bidirectional GRU with different initial hidden state for different input text, (o game , o look , o inv , a N ).The state representation only takes encoded textual observations while the natural language action a N is encoded to be used by the critic (introduced shortly).State encoder embeds game scores into a high dimensional vector and adds it to the encoded observation.This is then, passed through a feed-forward neural network, mapping an instance of observation to state representation without the history of the past information.
Actor.The Actor-Critic design is used for our RL component.We describe our generative actor first.Our actor network maps from state representation to action representation.Then, the action representation is decoded by GRU-based template and object decoders (Ammanabrolu and Hausknecht, 2020).Template decoder takes action representation and produces the template distribution and the context vector.Object decoder takes action representation, semi-completed natural language action and the context from template decoder to produce object distribution sequentially.Critic.Similar to (Haarnoja et al., 2018), we employed two types of critics for practical purpose, state critic for state value function and state-action critic for state-action value function.Both critics take the state representation as input, but stateaction critic takes encoded natural language action as an additional input.The textual command produced by the decoder is encoded with text encoder and is passed through state-action critic to predict state-action value, or Q value, for a given command.A more detailed diagram for Actor and Critic is in Appendix A. To smooth the training, we introduced target state critic as an exponentially moving average of state critic (Mnih et al., 2015).Also, the two state-action critics are independently updated to mitigate positive bias in the policy improvement (Fujimoto et al., 2018).We used the minimum of the two enhanced critic networks outputs as our estimated state-action value function.
Objective Function.Our objective functions are largely divided into two, RL and SL.RL objectives are for reward maximization L R , state value prediction L V , and state-action value prediction L Q .We overload the notation of θ: for instance, V θ (o) signifies parameters from the encoder to the critic, and π θ (a|o) from the encoder to the actor.Reward maximization is done as follows, (2) where A(o, a) is the normalized advantage function with no gradient flow.
where o is observation in the next time step and θ signifies the parameters containing the target state critic, updated as moving average with τ , Our SL updates the networks to produce valid templates and valid objects, The final loss function is constructed with λ coefficients to control for trade-offs, Our algorithm is akin to vanilla A2C proposed by Ammanabrolu and Hausknecht (2020) with some changes under our observations.A detailed comparison and qualitative analysis are in Appendix B and C.
-admissible Exploration.We use a simple exploration technique during training, which samples the next action from admissible actions with probability threshold.For a given state s, define A a (s) ⊆ A N as an admissible action subset of all natural language actions set.We sample an action directly from admissible action set under uniform distribution, a N ∼ U(A a (s)).Formally, we uniformly sample p ∈ [0, 1] per every step, This collects diverse experiences from altering the world with admissible actions.We also tried a variant where the is selected adaptively given the game score the agent has achieved.However, this variant under-performed the static .See Appendix I for more details on this and the results.

Experiments
In this section, we provide a description of our experimental details and discuss the results.We selected a wide variety of agents (introduced in Section 3) utilizing the LM or the KG: CALM-DRRN (Yao et al., 2020) and KG-A2C (Ammanabrolu and Hausknecht, 2020) as baselines, and SHA-KG (Xu et al., 2020), Q*BERT (Ammanabrolu et al., 2020), HEX-RL and HEX-RL-IM (Peng et al., 2021) as state-of-the-art (SotA).Experimental Setup.Similar to KG-A2C, we train our agent on 32 parallel environments with 5 random seeds.We trained TAC on games of Jericho suite with 100k steps and evaluated with 10 episodes per every 500 training step.During the training, TAC uses uniformly sampled admissible action for a probability of and during the testing, it follows its policy distribution generated from the game observations.We used prioritized experience replay (PER) as our replay buffer (Schaul et al., 2016).We first fine-tune TAC on ZORK1, then apply the same hyper-parameters for all the games.The details of our hyper-parameters can be found in Appendix E. Our final score is computed as the average of 30 episodic testing game scores.Additionally, our model has a parameter size of less than 2M, allowing us to run the majority of our experiments on CPU (Intel Xeon Gold 6150 2.70 GHz).
The training time comparison and the full parameter size in ZORK1 can be found in Appendices F and G.There are a few games that TAC under-performs.We speculate three reasons for this: over-fitting, exploration, and catastrophic forgetting.For instance, as illustrated by the learning curves of TAC in Figure 2, LUDICORP appears to acquire more reward signals during training, but fails to achieve them during testing.We believe this is because the agent is over-fitted to spurious features in specific observations (Song et al., 2020), producing inadmissible actions for a given state that are admissible in other states.On the other hand, TAC in OMNIQUEST cannot achieve a game score more than 5 in both training and testing.This is due to the lack of exploration, where the agent is stuck at certain states because the game score is too far to reach.This, in fact, occurs in ZORK3 and ZTUU for some random seeds, where few seeds in ZORK3 do not achieve any game score while ZTUU achieves 10 or 13 only, resulting in high variance.Finally, catastrophic forgetting (Kirkpatrick et al., 2016) is a common phenomenon in TGs (Hausknecht et al., 2020;Ammanabrolu and Hausknecht, 2020), and this is also observed in JEWEL with TAC.Training Score vs. Testing Score.

Main Results
Figure 2 shows that the game scores during training and testing in many games are different.There are three interpretations for this: (i) the -admissible exploration triggers negative rewards since it is uniformly sampling admissible actions.It is often the case that negative reward signal triggers termination of the game, i.e. −10 score in ZORK1, so this results in episodic score during training below testing.(ii) the -admissible exploration sends the agent to the rarely or never visited state, which is commonly seen in ZTUU.This induces the agent taking useless actions that would not result in rewards since it does not know what to do.(iii) Overfitting where testing score is lower than training score.This occurs in LUDICORP, where the agent cannot escape certain states with its policy during testing.-admissible exploration lets the agent escape from these state during training, and therefore, achieves higher game score.

Ablation
-Admissible Exploration.To understand how influences the agent, ablations with two values, 0.0 and 1.0, on five selective games were conducted.As shown in Figure 3, in the case of = 0.0, the agent simply cannot acquire reward signals.TAC achieves 0 game score in RE-VERB, ZORK1 and ZORK3 while it struggles to learn in DETECTIVE and PENTARI.This indicates that the absence of -admissible exploration results in meaningless explorations until admissible actions are reasonably learned through supervised signals.With = 1.0, learning becomes unstable since this is equivalent to no exploitation during training, not capable of observing reward signals that are far from the initial state.Hence, tuned is important to allow the agent to cover wider range of states (exploration) while acting from its experiences (exploitation).
Supervised Signals.According to the Figure 3, removing SL negatively affects the game score.This is consistent with the earlier observations (Ammanabrolu and Hausknecht, 2020) reporting that KG-A2C without SL achieves no game score in ZORK1.However, as we can observe, TAC manages to retain some game score, which could be reflective of the positive role of -admissible exploration, inducing similar behaviour to SL.
From the observation that the absence of SL degrades the performance, we hypothesize that SL induces a regularization effect.We ran experiments with various strengths of supervised signals by increasing λ T and λ O in LUDICORP and TEMPLE, in which TAC attains higher scores at training compared with testing.As seen in Figure 4 (left two plots), higher λ T and λ O relaxes over-fitting, reaching the score from 7.7 to 15.8 in LUDICORP and from 5.8 to 8.0 in TEMPLE.Since SL is not directly related to rewards, this supports that SL acts as regularization.Further experimental results on ZORK1 is in Appendix D.
To further examine the role of admissible actions in SL, we hypothesize that SL is responsible for guiding the agent in the case that the reward signal is not collected.To verify this, we excluded -admissible exploration and ran TAC with different λ T and λ O in REVERB and ZORK1, in which TAC fails to achieve any score.According to Figure 4 (right two plots), TAC with stronger SL and = 0.0 achieves game scores from 0 to 8.3 in REVERB, and from 0 to 18.3 in ZORK1, which suggests that SL acts as guidance.However, in the absence of -admissible exploration, despite the stronger supervised signals, TAC cannot match the scores using -admissible exploration.

Admissible Action Space During Training.
To examine if constraining the action space to admissible actions during training leads to better utilization, we ran an ablation by masking template and object with admissible actions at training time.This leads to only generating admissible actions.Our plots in Figure 3 show that there is a reduction in the game score in PENTARI, REVERB and ZORK1 while DETECTIVE and ZORK3 observe slight to  substantial increases, respectively.We speculate that the performance decay is due to the exposure bias (Bengio et al., 2015) introduced from fully constraining the action space to admissible actions during training.This means the agent does not learn how to act when it receives observations from inadmissible actions at test phase.However, for games like ZORK3, where the agent must navigate through the game to acquire sparse rewards, this technique seems to help.

Qualitative Analysis
In this section, we show how CALM and KG-A2C restrict their action space.Table 2 shows a snippet of the gameplay in ZORK1.Top three rows are the textual observations and the bottom three rows are the actions generated by CALM, the objects extracted from KG in KG-A2C, and the admissible actions from the environment.CALM produces 30 different actions, but still misses 10 actions out of 17 admissible actions.Since DRRN learns to estimate Q value over generated 30 actions, those missing admissible actions can never be selected, resulting in a lack of exploration.On the other hand, KG-generated objects do not include 'sack' and 'painting', which means that the KG-A2C masks these two objects out from their object space.Then, the agent neglects any action that includes these two object, which also results in a lack of exploration.

Discussion
Supervised Learning Loss.Intuitively, RL is to teach the agent how to complete the game while SL is to teach how to play the game.If the agent never acquired any reward signal, learning is only guided by SL.This is equivalent to applying imitation learning to the agent to follow more probable actions, a.k.a.admissible actions in TGs.However, in the case where the agent has reward signals to learn from, SL turns into regularization ( §5.2), inducing a more uniformly distributed policies.In this sense, SL could be considered as the means to introduce the effects similar to entropy regularization in Ammanabrolu and Hausknecht (2020).
Exploration as Data Collection.In RL, the algorithm naturally collects and learns from data.Admissible action prediction from LM is yet to be accurate enough to replace the true admissible actions (Ammanabrolu and Riedl, 2021;Yao et al., 2020).This results in poor exploration and the agent may potentially never reach a particular state.On the other hand, KG-based methods (Ammanabrolu and Hausknecht, 2020;Xu et al., 2020;Peng et al., 2021) must learn admissible actions before exploring the environment meaningfully.This will waste many samples since the agent will attempt inadmissible actions, collecting experiences of the unchanged states.Additionally, its action selection is largely dependent on the quality of KG.The missing objects from KG may provoke the same effects as LM, potentially obstructing navigating to a particular state.In this regards,admissible exploration can overcome the issue by promoting behaviour that the agent would take after learning admissible actions fully.Under such conditions that a compact list of actions is either provided the environment or extracted by algorithm (Hausknecht et al., 2020), our approach can be employed.Intuitively, this is similar to playing the game with a game manual but not a ground truth to complete the game, which leads to collecting more meaningful data.It also collects more diverse data due to the stochasticity of exploration.Hence, TAC with -admissible exploration can learn how to complete the game with minimal knowledge of how to play the game.
Bias in Exploration.Our empirical results from adaptive experiments in Appendix I suggest that reasonable is required for both under-explored states and well-explored states.This could indicate that diverse data collection is necessary regardless of how much the agent knows about the game while value should not be too high such that the agent can exploit.Finally, from our ablation, fully constraining action space to admissible actions degrades performance.This could be a sign of exposure bias, which is a typical issue in NLG tasks (He et al., 2019;Mandya et al., 2020) and occurs between the training-testing discrepancy due to the teacher-forcing done at training (He et al., 2019).In our setting, this phenomena could potentially occur if the agent only learns from admissible actions at training time.Since -admissible exploration allows a collection of experiences of any actions (i.e., potentially inadmissible actions) with probability of 1 − , TAC with reasonable learns from high quality and unbiased data.Our observations indicate that both the algorithm that learns from data, and the exploration to acquire data are equally important.

Conclusion
Text-based Games (TGs) offer a unique framework for developing RL agents for goal-driven and contextually-aware natural language generation tasks.In this paper we took a fresh approach in utilizing the information from the TG environment, and in particular the admissibility of actions during the exploration phase of RL agent.We introduced a language-based actor critic method (TAC) with a simple -admissible exploration.The core of our algorithm is the utilization of admissible actions in training phase to guide the agent exploration towards collecting more informed experiences.Compared to state-of-the-art approaches with more complex design, our light TAC design achieves substantially higher game scores across 10-29 games.
We provided insights into the role of action admissibility and supervision signals during training and the implications at test phase for an RL agent.Our analysis showed that supervised signals towards admissible actions act as guideline in the absence of reward signal, while serving a regularization role in the presence of such signal.We demonstrated that reasonable probability threshold is required for high quality unbiased experience collection during the exploration phase.
Similar to CALM-DRRN (Yao et al., 2020), KG-A2C (Ammanabrolu and Hausknecht, 2020) and KG-A2C variants (Ammanabrolu et al., 2020;Xu et al., 2020;Peng et al., 2021) that use admissible actions, our method still utilizes admissible actions.This makes our TAC not suitable for environments that do not provide admissible action set.In the absence of admissible actions, our TAC requires some prior knowledge of a compact set of more probable actions from LMs or other sources.This applies to other problems, for instance, applying our proposed method to language-grounded robots requires action candidates appropriate per state that they must be able to sample during training.The algorithm proposed by Hausknecht et al. (2020) extracts admissible actions by simulating thousands of actions per every step in TGs.This can be used to extract a compact set of actions in other problems, but it would not be feasible to apply if running a simulation is computationally expensive or risky (incorrect action in real-world robot may result in catastrophic outcomes, such as breakdown).

Ethical Considerations
Our proposal may impact other language-based autonomous agents, such as dialogue systems or language-grounded robots.In a broader aspect, it contributes to the automated decision making, which can be used in corporation and government.When designing such system, it is important to bring morals and remove bias to be used as intended.

A Details of Actor and Critic Components
Consider an action example (take OBJ from OBJ, egg, fridge) as (template, first object, second object).Template a T = (take OBJ from OBJ) is sampled from template decoder and encoded to h T with text encoder.Object decoder takes action representation a and encoded semi-completed action h T and produces the first object a O1 = (egg).The template a T = (take OBJ from OBJ) and the first object a O1 = (egg) are combined to a T,O1 = (take egg from OBJ), a T ⊗ a O1 = a T,O1 .a T,O1 is then, encoded to hidden state h T,O1 with text encoder.Similarly, the object decoder takes a and h T,O1 and produces the second object a O2 = (fridge).a T,O1 and a O2 are combined to be natural language action, a T,O1 ⊗ a O2 = a N Finally, a N is encoded to h a with text encoder and inputted to state-action critic to predict Q value.
B Comparison with Vanilla A2C in Ammanabrolu and Hausknecht (2020) Architecture.Vanilla A2C from Ammanabrolu and Hausknecht (2020)   of actor and state value critic, so the state representation is used to estimate state value and produce the policy distribution.
Our TAC uses a single shared GRU to encode textual observations and previous action with different initial state to signify that the text encoder constructs the general representation of text while the game score is embedded to learnable high dimentional vector.However, when constructing state representation, we only used (o game , o look , o inv ) under our observation that o game carries semantic information about a t−1 .Additionally, we also observed that the learned game score representation acts as conditional vector in Appendix C, so the state representation is constructed as an instance of observation without historical information.Finally, we included additional modules, state-action value critic (Haarnoja et al., 2018), target state critic (Mnih et al., 2015) and two state-action critics (Fujimoto et al., 2018;Haarnoja et al., 2018) for practical purpose.
Objective Function.Three objectives are employed in Ammanabrolu and Hausknecht (2020), reinforcement learning (RL), supervised learning (SL) and entropy regularization.Both RL and SL are also used in our objectives with minor changes in value function update in RL.That is, two stateaction value critics are updated independently to predict Q value per state-action pair and target state critic is updated as moving average of state critic Notable difference is that we excluded entropy regularization from Ammanabrolu and Hausknecht (2020).This is because under our ablation in Section 5.2, we observed that SL acts as regularization.
Replay Buffer Unlike on-policy vanilla A2C (Ammanabrolu and Hausknecht, 2020), since TAC utilizes -admissible exploration, it naturally sits as off-policy algorithm.We used prioritized experience replay (PER) as our replay buffer (Schaul et al., 2016).Standard PER assigns a newly acquired experience with the maximum priority.This enforces the agent to prioritize not-yet-sampled experiences over others.As we are using 32 parallel environments and 64 batch size for update, half of the updates will be directed by newly acquired experiences, which not all of them may be useful.Thus, instead, we assign newly acquired experience with TD errors when they are added to the buffer.This risks the agent not using some experiences, but it is more efficient since we sample useful batch of experiences.

C Qualitative Analysis
It has been repetitively reported that including game score when constructing state helps in TGs (Ammanabrolu and Hausknecht, 2020;Jang et al., 2021).Here, we provide some insights in what the agent learns from the observations using fully trained TAC.To illustrate this, we highlight the role of game score on the action preference of the TAC for the same observation in ZORK1.Observations for different cases can be found in Table 3 and Table 5 while the policy and Q value are in Table 4 and Table 6.
Case 1 in Table 3 and Table 4 For three different cases, Case 1.1, Case 1.2, and Case 1.3, the agent is at Kitchen location, so many semantic meaning between textual observations are similar, i.e. o look or o inv .For each case, the agent is meant to go west with n score = 10, go west with n score = 39, and go east with n score = 45, respectively.In Case 1.1, despite the optimal choice of action is west, by replacing the score Case 1.1 Step: 4 Game: Kitchen You are in the kitchen of the white house.A table seems to have been used recently for the preparation of food.A passage leads to the west and a dark staircase can be seen leading upward.A dark chimney leads down and to the east is a small window which is open.On the table is an elongated brown sack, smelling of hot peppers.A bottle is sitting on the table.The glass bottle contains: A quantity of water Look: Kitchen You are in the kitchen of the white house.A table seems to have been used recently for the preparation of food.A passage leads to the west and a dark staircase can be seen leading upward.A dark chimney leads down and to the east is a small window which is open.On the  from n score = 10 to n score = 45, the agent chooses east, which is appropriate for Case 1.3.Another interesting observation is that replacing game score decreases Q value from 23.7460 to 5.0134 for west and from 18.4385 to 6.0319 for east in Case 1.1.This seems like the agent thinks it has already acquired reward signals between n score = 10 and n score = 45, resulting in a reduction in Q value.We speculate that this is because the embedding of n score carries some inductive bias, i.e. temporal, for the agent to infer the stage of the game.This is consistently manifested in Case 1.3, but in Case 1.2, the agent is robust to the game score because it carries painting that is directly related to reward signals, navigating to pursue that particular reward, which is put paining in case for reward signal of +6 in Living Room location.
Case 2 in Table 5 and Table 6 In Case 2, the agent is at Behind House for three other sets of game instances, which has action and score pair as, open window for n score = 0, west for n score = 0, and north for n score = 45.The phenomenon between Case 1.1 and Case 1.3 occurs the same for Case 2.2 and Case 2.3.However, unlike Case 1, the score between Case 2.1 and Case 2.2 is the same.This means that the agent somehow chooses the optimal action for Case 2.2 over Case 2.1 in the case where n score = 0 is injected for Case 2.3.This appears to be that the agent can capture semantic correlation between "In one corner of the house there is a small window which is open" from textual observation in Case 2.3 and open window action.Because a small window is already opened, open window action is no longer required, so the agent tends to produce west, which is appropriate for Case 2.2.
Thus, from our qualitative analysis, we speculate that the agent captures the semantics of the textual observations and infers the game stage from game score embedding to make optimal decision.

D Stronger Supervised Signals for ZORK1
We also explored how stronger supervised signals can induce better regularization in ZORK1.Similar to other sets of experiments, we selected variety of λ T -λ O pair.However, our results show that TAC starts under-fitting in ZORK1 when larger λ T and λ O are applied.the hyper-parameters and n tst is the average testing game score.During training, higher n score exponentially increases while a controls the slope of the exponential function.Higher a makes the slope more steep.Intuitively, as the agent exploits the well-known states, is small, encouraging the agent to follow its own policy, and as the agent reaches the under-explored states (i.e., similar to test condition), increases to encourage more diversely.The is normalized and scaled.The example plot is shown in FIgure 10.
We conducted a set of ablations with dynamic value in DETECTIVE, PENTARI, REVERB, ZORK1 and ZORK3.We used min = {0.0,0.3}, a = {3, 9} and max = {0.7,1.0}, so total 8 different hyper-parameters.Figure 8 shows fixed min = 0.0 with varying a and max and Figure 8 shows fixed min = 0.3.Other than ZORK3, TAC with dynamic matches or underperforms TAC with fixed = 0.3.There are two interesting phenomenons.(i) Too high max results in more unstable learning and lower performance.This becomes very obvious in PENTARI, REVERB and ZORK1, where regardless of min and a , if max = 1.0, the learning curve is relatively low.In DETECTIVE of Figure 8, the learning becomes much more unstable with max = 1.0.This indicates that even underexplored states, exploitation may still be required.(ii) Too low min results in more unstable learning and lower performance.Although PENTARI benefits from min = 0.0, the learning curves in Figure 8 is generally lower and unstable than Figure 9.This appears to be that despite how much the agent learned the environment, it still needs to act stochastically to collect diverse experiences.

Figure 1 :
Figure 1: Text-based Actor-Critic (TAC); A blue circle is the input to the encoder, (n score , o game , o look , o inv ) representing (game score, game feedback, room description, inventory), while a red circle is the output from actor, a N representing natural language action.Blue, red and green boxes indicate encoder, actor and critic, respectively.
otherwise where L T and L O are the cross entropy losses over the templates (T) and objects (O).Template and object are defined as a T and a O , while â is the action constructed by previously sampled template and object.Positive samples, y a T and y a O , are only if the corresponding template or object are in the admissible template (T a ) or admissible object (O a ).4

Figure 2 :
Figure 2: The full learning curve of TAC on five games in Jericho suite.Blue and red plots are training and testing game score while cyan and yellow star marker line signify CALM-DRRN and KG-A2C.

Figure 3 :
Figure 3: Ablation study on five popular games in Jericho suite.Four different ablation are conducted with SL, = 0.0, = 1.0, and with full admissible constraints during training (Admissible Action space).Similar to the previous figure, CALM-DRRN and KG-A2C are added for comparison.

Figure 4 :
Figure 4: The learning curve of TAC for stronger supervised signals where 5-3 signifies λ T = 5 and λ O = 3. Left two plots are with = 0.3 and right two are with = 0.

Figure 5 :
Figure 5: The details of actor and critic of text-based actor-critic; State representation is the input to actor-critic while a red circle is the output from actor, a N representing natural language action.Red and green boxes indicate actor and critic, respectively.

Figure 10 :
Figure 10: The exponential probability of over the game score.Left is with min = 0.0, max = 1.0 and right is with min = 0.3, max = 0.7 between the game score of 0 to 6. Five different a is drawn per plot.
A, P, O, P o , R, γ), where S and A are a set of state and action, and P is the state transition probability that maps state-action pair to the next state, Pr(s t+1 |s t , a t ).O is a set of observation that depends on the current state via an emission probability, P o ≡ Pr(o t |s t ).R is an immediate reward signal held between the state and the next state, r(s t , s t+1 ), and γ is the discount factor.The action selection rule is referred to as the policy π(a|o), in which the optimal policy acquires the maximum rewards in the shortest move.TG Environment as POMDP.Three textual observations are acquired from the engine, game feedback o game , room description o look , and inventory description o inv .The game feedback is dependent on the previous action, Pr(o game,t |s Partially Observable Markov Decision Process.TG environments can be formalized as Partially Observable Markov Decision Processes (POMDPs).A POMDP is defined as a 7-tuple, (S, t , a t−1 ), while room and inventory descriptions are not, Pr(o look,t |s t ) and Pr(o inv,t |s t ).Inadmissible actions do not influence the world state, room and inventory descriptions but change the game feedback changes.Each action is sampled sequentially from template-based action space.For template, we directly sample from observation π(a T |o) while an object policy is sequentially produced, π(a O |o, â), where â is previously sampled template-object pair.

Table 1 :
Game score comparison over 10 popular game environments in Jericho, with best results highlighted by boldface.We only included algorithms that reported the end performance.† HEX-RL and HEX-RL-IM did not report the performance in ZORK3 and are not open-sourced, so the mean average did not account ZORK3.

Table 1
50% higher score than both CALM-DRRN and KG-A2C with normalized mean score.Per game, in SORCERER, SPIRIT, ZORK3 and ZTUU, TAC achieves at least ∼ 200% and at most ∼ 400% higher score..In ACORNCOURT, DEEPHOME and DRAGON, both GameKitchen.On the table is an elongated brown sack, smelling of hot peppers.A bottle is sitting on the table.The glass bottle contains: A quantity of water.Inventory You are carrying: A painting, A brass lantern (providing light) Room Kitchen.You are in the kitchen of the white house.A table seems to have been used recently for the preparation of food.A passage leads to the west and a dark staircase can be seen leading upward.A dark chimney leads down and to the east is a small window which is open.On the table is an elongated brown sack, smelling of hot peppers.A bottle is sitting on the table.

Table 2 :
Action space for a game observation (top panel) for CALM (LM), KG-A2C (KG), and the Admissible Action sets.Red and blue colored actions are the actions missed by either CALM or KG-A2C.Brown are the actions missed by both, and blacks are actions covered by both.
uses separate gated recurrent units (GRUs) to encode textual observations and previous action, (o game , o look , o inv , a t−1 ), and transforms the game score, n score , into binary encoding.Then, they are concatenated and passed through state network to form state representation.Their state network is GRU-based to account historical information.The actor-critic network consists table is an elongated brown sack, smelling of hot peppers.A bottle is sitting on the table.The glass bottle contains: A quantity of water Inv: You are empty handed.Kitchen On the table is an elongated brown sack, smelling of hot peppers.A bottle is sitting on the table.The glass bottle contains: A quantity of water Look: Kitchen You are in the kitchen of the white house.A table seems to have been used recently for the preparation of food.A passage leads to the west and a dark staircase can be seen leading upward.A dark chimney leads down and to the east is a small window which is open.On the table is an elongated brown sack, smelling of hot peppers.A bottle is sitting on the table.The glass bottle contains: A quantity of water Inv: You are carrying: A painting A brass lantern (providing light) Kitchen On the table is an elongated brown sack, smelling of hot peppers.A bottle is sitting on the table.The glass bottle contains: A quantity of water Look: Kitchen You are in the kitchen of the white house.A table seems to have been used recently for the preparation of food.A passage leads to the west and a dark staircase can be seen leading upward.A dark chimney leads down and to the east is a small window which is open.On the table is an elongated brown sack, smelling of hot peppers.A bottle is sitting on the table.The glass bottle contains: A quantity of water Inv: You are empty handed.

Table 3 :
Case 1; Game observation and the selected action snippets from ZORK1.