Language-based General Action Template for Reinforcement Learning Agents

Prior knowledge plays a critical role in decision-making, and humans preserve such knowledge in the form of natural language (NL). To emulate real-world decision-making, artiﬁcial agents should incorporate such generic knowledge into their decision-making framework through NL. However, since policy learning with NL-based action representation is intractable due to NL’s combinatorial complexity, previous studies have limited agents’ expressive power to only a speciﬁc environment, which sacriﬁced the generalization ability to other environments. This paper proposes a new environment-agnostic action framework, the language-based general action template ( L-GAT ). We design action templates on the basis of general semantic schemes (FrameNet, VerbNet, and WordNet), facilitating the agent in ﬁnding a plausible action in a given state by using prior knowledge while covering broader types of actions in a general manner. Our experiment using 18 text-based games showed that our proposed L-GAT agent which uses the same actions across games, achieved a performance competitive with agents that rely on game-speciﬁc actions. We have published the code at https://github.com/kohilin/lgat .


Introduction
The incorporation of natural language processing (NLP) and reinforcement learning (RL) is an important research field for using knowledge, represented in the form of language, in the decision-making of artificial agents (Luketina et al., 2019). One critical topic is the capability to describe an agent's actions with natural language (NL) (Narasimhan et al., 2015;Yuan et al., 2019). An agent with such a capability can estimate the plausibility of actions on the basis of prior knowledge accessible through

Strong prior knowledge
Common Natural Language-based Action Set language (Narasimhan et al., 2018). Suppose that an agent receives a request, "give me some water"; a common sense idea like "water exists in the kitchen" will definitely help the agent determine the right direction to go in. If actions are represented in language, for example, GO TO KITCHEN, we can straightforwardly connect knowledge to actions by referring to language resources (Fulda et al., 2017). NL is useful for accessing knowledge to achieve plausible decisionmaking, and this language capability would be fundamental in developing intelligent agents. However, NL is complicated for current RL agents to acquire due to its high expressive power . The rich vocabulary and complex grammar of NL result in a huge action space that is intractable for existing RL algorithms. Although we can train an agent by restricting the expressive power to a specific environment (Narasimhan et al., 2015;He et al., 2016), such constraints sacrifice the inherent advantage of using NL as the action representation for domain-independent learning. Our objective is to make better trade-offs between expressive power and the learnability of NL-based RL.
Interactive fiction (IF) games serve as a practical testbed for NL-based RL, where an agent and environment communicate with each other through L-GAT module textual information (Côté et al., 2019). The agent needs to understand a textual state and generate an appropriate NL action command. In the experiments of previous studies, various constraints on the action space have been used for the convergence of learning, such as simple grammar and vocabulary (Narasimhan et al., 2015), effective ground-truth actions for a state (He et al., 2016), and game-specific templates . These constraints are problematic, especially when applying the same algorithm to other environments. There are several studies that address this by implementing heuristic rules (Hausknecht et al., 2019) or training an action generator with observation-action pairs of humanplay (Yao et al., 2020), but they still have a nontrivial bias toward IF games.
In terms of prior knowledge, word embeddings and language models have shown success at reducing the action space (Fulda et al., 2017;Yao et al., 2020).
Although such continuous representations have achieved notable performances for broader NLP tasks, they are basically trained with word co-occurrences. We cannot flexibly manipulate the knowledge contained in these continuous representations because it is difficult to selectively encode the desired knowledge into them (Zhou et al., 2020). To make the most of prior knowledge expressed in NL, we should take advantage of other linguistic resources as well that provide finegrained information more explicitly, such as the hierarchical structure of words (Miller, 1998), the semantics of a sentence (Fillmore et al., 2003), and common sense (Speer et al., 2017).
Thus, we propose a new environment-agnostic action framework on the basis of general semantic schemes, the language-based general action template (L-GAT). Figure 1 shows the overall architecture compared with the environmentdependent action templates commonly used, and Figure 2 illustrates the flow of generating an action with L-GAT. The agent with L-GAT first determines "what to do" at an abstractive level with generally defined action templates and then specifies "how to do it" by generating a concrete action on the basis of connected prior knowledge.
Our contributions with L-GAT are threefold. First, we propose an environment-agnostic action framework based on general semantic schemes such as FrameNet (Fillmore et al., 2003), VerbNet (Schuler, 2006), and WordNet (Miller, 1998. Second, we develop a hierarchical actiongeneration algorithm in which the agent performs abstractive decision-making and then generates a concrete action command. Third, we introduce a method for dynamically reducing the action space with static knowledge from multiple external resources and contextual information from a state. We have published the code with which future studies can easily use L-GAT as an action generation module.

Related Work
The action space problem of NL-based actions has been addressed in previous work. LSTM-DQN (Narasimhan et al., 2015) enables an agent to learn a policy in a simple verb-object format in synthetic environments. DRRN (He et al., 2016) is a ranking-based method in which the agent selects an action from among admissible actions given by a game engine. Game-specific templates were proposed for Jericho . The agent selects one of the templates and fills in the gaps in the chosen template. TDQN (Hausknecht et al., 2020) and KGA2C (Ammanabrolu and Hausknecht, 2020) agents have used these templates and succeeded at learning a policy in various games. NAIL (Hausknecht et al., 2019) solves IF games with action generation heuristics.
Language resources have also been used for reducing the action space. Fulda et al. (2017) extracted the affordances of words with word embeddings and restricted objects that can be taken for a specific verb. A language model is leveraged to filter non-plausible word combinations (Kostka et al., 2017). Recently, Yao et al. (2020) proposed the contextual action language model (CALM), which trains a language model with human-play transcriptions and uses it as an action candidate generator. By combining it with a ranking-based method, CALM showed significant performance even without a ground-truth for admissible actions.

Problem Setting
In this section, we formally define our problem setting, RL agents with NL-based actions. An NLbased action a, such as GO TO KITCHEN and GIVE SOME WATER TO JOHN, consists of N words: a = {w 1 , w 2 , . . . , w N }, w i ∈ V, where each w i is a word and V is vocabulary. Given a state s, we represent an optimal NL-based action in the state with a * (s) and the words composing it with w * (i,s) , respectively. Following the previous studies, each w ∈ V is estimated with an independent Q-function as Q (i) (s, w), where i is the position of a word, and Q(s, a) is defined as Σ i Q (i) (s, w). A policy π is evaluated with a cumulative reward as: where R is a reward function for a state-action pair, and γ is a discount factor. Then, the Q-functions Q * , Q * corresponding to the optimal policy are learned by minimizing the following loss: L = (R(s, a) + γQ * (s , a ) − Q * (s, a)) 2 , whereQ * (s , a ) = Σ i max w Q * (i) (s , w ) with the next state s , the next word w , and the target Q-functionQ. Then, we obtain each word of the optimal action as: w * (i,s) = arg max w Q * (i) (s, w). However, this optimization could be intractable depending on N and the size of V due to the exponential complexity O(|V| N ).
Hence, we consider restricting the vocabulary space for each position i for the action and state s to make this optimization problem tractable. Specifically, we want to find a subset V (i,s) of V for each (i, s) such that |V (i,s) | is much smaller than |V|. Also, V (i,s) must include the optimal word w * (i,s) for the original optimization problem because replacing V with V (i,s) should not change the optimal solution. Assuming that we have V (i,s) ⊂ V, the computational complexity To define V (i,s) , we use knowledge K: where ϕ is a set-valued function. Then, we need to determine the function ϕ that minimizes |V (i,s) | but keeps the i-th word of an optimal NL-based action in V (i,s) : Note that, although the above formulation assumes that V (i,s) always contains the optimal words w * (i,s) , it is practically impossible because the optimal words are unknown a priori and estimated through the learning process. Therefore, we need to approximate ϕ so that the estimated V (i,s) is likely to contain w * (i,s) as much as possible. We introduce our implementation of the approximation in Section 4.4.

Method
We now introduce our proposed method; the language-based general action template (L-GAT). L-GAT is a framework of NL-based actions for environment-agnostic RL agents. In L-GAT, we define actions on the basis of general semantic schemes, such as FrameNet (Fillmore et al., 2003), VerbNet (Schuler, 2006), and WordNet (Miller, 1998), which enables an agent to use connected knowledge to reduce the action space.
In this section, we first provide an overview of L-GATand then give details on its action command, definition, and generation algorithm.

Overview
L-GAT adopts hierarchical modeling for action generation to handle the vast space of NL-based action. Specifically, an agent with L-GAT first determines "what to do" at an abstractive level (e.g., decide to give) and then next determines "how to do it" at a concrete level (e.g., decide to give tomato to mom) as shown in Figure 2. This two-step strategy is intuitive and natural as a general procedure for decision-making (Lazaridou et al., 2020). From the viewpoint of RL, we can reduce the action space because the number of abstractive actions is much smaller than that of all possible words in the whole vocabulary. Also, the chosen abstractive action further restricts the words available for generating a complete action (e.g., we cannot eat a table) To use such hierarchical modeling, L-GAT defines the Abstractive Action Template with a hierarchical structure consisting of three components: Frame, Role, and Lexicon. Figure 3 illustrates the hierarchical structure of an abstractive action template, GO. The frame defines the semantics of GO (e.g., move oneself to somewhere), the role represents a conceptual argument required to perform the action (e.g., Destination), and the lexicon represents concrete words that can be used for one of the roles in the frame (e.g., west, down, and kitchen). Using the generation algorithm, the agent selects an appropriate word from the lexicon for each frame's role in the chosen abstractive action template. FrameNet and VerbNet inspired the hierarchical structure of L-GAT. Both resources provide semantic schemes that describe our daily behaviors conceptually. For example, similar to the GO template, FrameNet has a relevant scheme, "selfmotion", with required arguments such as Goal and Source. Also, we connect the lexicon of L-GAT to WordNet to represent the hierarchical relationships of words. The word hierarchy enables agents to generalize candidate words (e.g., hyponyms of "direction," such as "west," can be used for the GO template) and interact with diverse environments. Thus, by designing actions on the basis of these general semantic schemes, we can naturally make L-GAT environment-agnostic and familiar with general prior knowledge.
Note that the templates of L-GAT are easily configurable depending on the environment. Such customizability is practically essential to applying the scheme to diverse domains, which is costly or difficult for an environment-dependent action set or language model-based generator.

Action Command
L-GAT generates an action command in a fixed format as: a = v + n 1 + p + n 2 , with at most four slots for a verb (v), two nouns (n 1 , n 2 ), and a preposition (p). An action command is generated by the action templates explained in the next section, and n 1 , n 2 , and p are not necessarily used depending on the template. To prevent the exponential growth of the action space, we decided to add these hard constraints to a generable format. However, we consider the aforementioned format to be able to cover most of the primary actions demanded by agents, such as go kitchen and put cup on table.

Action Definition
This section introduces the abstractive action template and its components. In this paper, we manually defined 41 abstractive action templates for L-GAT (see Appendix A). For the construction of each template, we chose relevant semantic schemes from FrameNet and VerbNet, and then aggregated them in a compatible form with the actions of RL agents. Table 1 shows three templates as references, and we will refer to them throughout this section.
where f is the frame, and r (v|n 1 |p|n 2 ) are the roles for each slot. The frame defines the template's semantics. The roles represent required arguments for performing an action. Each word in an action command is an instance of its corresponding role and selected from the lexicon explained later.
Frame. Each frame has connections to semantically relevant entities in FrameNet (called Semantic Frames), and ones in VerbNet optionally (called Verb Classes). For example, the GIVE template is related to the semantic frame Giving and the verb class give-13.1. We use FrameNet as the basis because the granularity of the descriptions is more suitable for our purpose than VerbNet.
Role. For the definition of our roles, we borrow VerbNet's Thematic Roles, which refer to the semantic relationship between a predicate and its argument. VerbNet defines 30 thematic roles in total, and L-GAT uses 15 of them by taking into account their frequency and meaning. FrameNet also has a similar concept called Frame Elements; however, it is too fine-grained for RL agents' actions. Therefore, we use the thematic roles of VerbNet as the basis and annotate the related frame elements as additional information. We also define three special roles: Predicate, Preposition, and Null. Predicate is used only for the verb slot and references to verb nodes in WordNet. We selected these nodes on the basis of the representative verbs given by the connected frame entities of FrameNet and VerbNet (e.g., we chose "go.v.02" and "enter.v.01" for the  Predicate of the GO template on the basis of representative verbs given by the semantic frame "self-motion"). Preposition has prepositions available that often accompany verbs in a frame, and we manually defined these prepositions. Null means not using the slot for action generation. For example, the GO template uses only v and n 1 .
Lexicon. The simplest way of defining the lexicon is to list all possible words manually or statistically; however, such a procedure would be non-scalable or hard to configure. Therefore, we decided to annotate nodes in WordNet as the representation. We selected general nodes such as "location.n.01" for the Destination of the GO template and "food.n.01" for the Patient of the EAT template. The hierarchical word relations in WordNet enables L-GAT to structurally determine candidate words such as by choosing the hyponyms of annotated nodes.

Action Generation
We propose three techniques for action generation with L-GAT: Hierarchical Prediction, Word Masking, and Template Masking. An agent with L-GAT generates an action as follows. The agent determines the abstractive action template and then selects a word for each slot from among candidates (Hierarchical Prediction). The candidates for each slot are filtered by using information from the state and the chosen template (Word Masking). By precomputing Word Masking for all templates in advance of the template selection, L-GAT can provide possible templates for a state by excluding templates that cannot generate any action command (Template Masking). The algorithm is described Algorithm 1 Action Generation with L-GAT.

Hierarchical Prediction
L-GAT first selects a template T with Q(s, T ) and then generates words for each slot as: w (·) = arg maxŵ Q (·) (s,ŵ),ŵ ∈ V (·,T,s) , where V (·,T,s) is vocabulary that is restricted for the target slot of the template T in the state s. L-GAT computes V (·,T,s) as: V (·,T,s) = {w | m(·, s, T, w) = 1, w ∈ V}, where m is a masker function that returns 1 for a generable word in the target slot by considering s and T . The masker function is our approximation of ϕ explained in Section 3, and we introduce the details in the next section.

Word Masking
We included five sub-masker functions; Role masker, Language Model (LM) masker, Partof-Speech (PoS) masker, Stopword masker, and Observation masker. Each of them returns 1 for generable words. Then, we define m(·, s, T, w) = 1 only when all the sub-maskers return 1. Next we give the definitions of the five sub-maskers.
Role masker. The Role masker filters words that are not semantically compatible with a given role by referring to the annotated WordNet nodes as mentioned in §4.3. For the noun slots, it enables hyponym words of the reference nodes to be produced. For example, for the n 1 slot of the EAT template, it returns 1 for hyponym words of the food.n.01 node such as water and tomato (see Table 1). For the v slot (i.e., Predicate), it returns 1 for lemmas of the annotated nodes themselves such as go.v.01 = go and enter.v.01 = enter. All words have 1 in the p slot, and no words have 1 in any Null slot.
LM masker. The LM masker filters contextually irrelevant candidates with an LM. The LM takes an observation suffixed with a verb of the template as the input (an example is given in Appendix B), and it predicts the following word. On the basis of the estimated probability of next words, the LM masker returns 1 for the top-k words. At the v and p slots, all words have 1. We used a pre-trained GPT-2 (Radford et al., 2019) and set k = 50.
PoS, Stopword, and Observation maskers. The PoS masker filters words in which a PoS is not matched with the expected one. Specifically, it returns 1 for verbs in the v slot, for nouns in the n 1 and n 2 slots, and for prepositions in the p slot. The word-PoS mapping follows WordNet in terms of verbs and nouns, and we prepared a fixed list of prepositions. The Stopword masker prohibits the generation of stopwords such as determiners (e.g., a, the) and pronouns (e.g., he, them). We prepared a fixed list, and the masker returns 1 for words that are not on the list. The Observation masker filters unseen words from observation. It returns 1 for words that appeared in the last observation.

Template Masking
L-GAT also filters unavailable templates in a state with the V (·,T,s) of each template. V (·,T,s) is empty  when one of the sub-maskers denies the use of a word, and this occurs for all words in V. An empty V (·,T,s) means that no appropriate objects exist for performing the action (e.g., we cannot eat anything if no eatable objects exist.). Thus, L-GAT selects a template as: T = arg max T ∈T Q(s, T ), where T is a set of templates in which V (·,T,s) is not empty for all slots except for NULL slots.

Experiment
In our experiment, we measured performance in IF games in Jericho . We compared our L-GAT agent with agents who rely on environment-dependent actions. We also performed detailed analysis such as on action coverage and action space reduction and masker ablation study on L-GAT. Furthermore, to evaluate the generalization ability of L-GAT, we also conducted an extended experiment in which a single agent solved all games.

Settings
Game environment. We selected 18 games in Jericho on the basis of the performance in previous studies. We did not select any game that was too hard for most of the existing agents to learn.
In the experiments with TDQN, KGA2C, and CALM, these methods preselected steps by excluding no-effect actions that do not change the world state for faster and stable learning. This limitation is not appropriate for testing general action sets such as L-GAT because the game engine explicitly restricts the action space.   To focus on the comparison of environmentdependent and agnostic action templates, we reimplemented the TDQN algorithm in our code base (denoted as TDQN+), in which all steps were counted regardless of action effectivity. Also, we implemented TDQN+ without the admissible action limitation; the difference in the above limitations between TDQN+ and L-GAT is their template types only (i.e., game-specific or general templates). Thus, we mainly compared NAIL, TDQN+, and L-GAT. Although a comparison with the original scores of TDQN, KGA2C, and CALM would not be fair, we also add them as references for completeness. Note that the vocabulary limitation is practically needed because any action containing unknown words for a game engine cannot be accepted even though the action command is semantically correct. CALM overcame this vocabulary issue by learning the word distribution of action commands used in IF games from human-play transcriptions.
Model and training details. Following the works of Hausknecht et al. (2020); Yao et al. (2020), we used a bidirectional GRU as our observation encoder, and the observation string was augmented by combining the observations returned by the "look" and "inventory" commands. All Q-functions were implemented with a multi-layered perceptron with the same hyperparameters. In an episode, we limited the number of steps to 100 at most and ran agents with ten environments in parallel. In addition to the rewards given by the games, we gave an exploration bonus of 1 when an agent found an unseen state in the episode (Yuan et al., 2019). We put the hyperparameter details in Appendix C. We trained three agents with different random seeds and used their average score for the evaluation.

Results
We report the results of the game performance, action coverage by L-GAT, action space reduction by L-GAT, and an ablation study on Word Masking. Finally, we introduce the results of a single agent that solved multiple games.
Game performance. Following the previous studies, we calculated the average score of the last 100 episodes, and the results are shown in Table 3. The average normalized score (raw score divided by the maximum score) was 8.0% for NAIL, 9.9% for TDQN+, and 8.9% for L-GAT. Even though the action templates of L-GAT were designed in a general manner, our agent achieved reasonable performance across games and outperformed NAIL and TDQN+ in six games.
NAIL and L-GAT performed poorly in enchanter and snacktime compared with TDQN+. In these games, a number of critical actions for advancing the story were missing because their generable actions are defined outside of a specific game. For example, artificial words like spells (e.g., frotz, gnusto) appear in enchanter, and they are intractable with L-GAT, which uses only common language. As an additional investigation, we added a "SPELL" action to L-GAT for producing these spell words and re-trained an agent with the enchanter game; the score of L-GAT then increased to 20.0, which was close to the score of TDQN+ (22.9).
The scores of KGA2C (14.1%) and CALM (15.5%) were significantly higher. This suggests that their techniques, such as the dynamic state graph encoding and the domain adaptation of LM's outputs, are promising, and we will investigate their importance for our framework as future work.
Walkthrough coverage. Jericho provides an action trajectory called a walkthrough that solves a game. We assessed how much L-GAT covered walkthrough actions. The detailed procedure of this assessment is explained in Appendix C. The right-most column in Table 3 shows the cover ratio of the walkthrough actions for L-GAT. The average coverage across games was 82%, and around 90% or more of the actions were covered in half of the games. A lower coverage ratio (< 70%) was observed in balances (53%), spellbrkr (57%), ztuu (62%), and library (69%). In their walkthrough actions, we frequently found artificial words (e.g., rezrov), named entities such as person's name (e.g., give card to Alan), and modifiers for objects (e.g., wear fur suit, take ticket 2306). Although the first case might be out-of-scope of L-GAT, the second and third are critical because they must appear in an interaction with real-world environments.
Action space reduction. We evaluated how much the action space was reduced in L-GAT by applying Word Masking (4.4.2) and Template Masking (4.4.3). For each slot in each available template (v, n 1 , p, and n 2 ), we counted the number of words that were accepted by these techniques. Compared with the size of the whole vocabulary, the number of words was significantly reduced by 94% for v (24.8 words on average after reduction), 82% for n 1 (117.9 words), 99% for p (7.7 words), and 59% for n 2 (292.2 words).
Masker ablation. Figure 5 shows the learning curves of the L-GAT agents without one of the  sub-maskers for Word Masking in zork1. A significant effect was observed with the Role masker. Specifically, the agent without the Role masker took more episodes to find the scored trajectory and degraded the overall performance. The PoS and Stopword maskers were not critical, which might be because their restrictions are already covered by the Role masker. The LM and Observation maskers had non-trivial effects.
Since the Role masker enables relatively broader words in noun slots (i.e., hyponyms of a general word), contextual restrictions given by the LM and Observation maskers helped the agent identify available objects.
Game-independent agent To investigate the generalization ability of L-GAT, we conducted a multi-task experiment where a single agent solved all the games. We prepared two types of agents: 1) a multi-task agent, who learns with all 18 games and solves each game, and 2) a Zero-shot agent, who learns with 17 games and solves the remaining unseen game. Note that L-GAT can be naturally applicable to this experimental setting. In terms of the implementation of TDQN+, we accumulated all the unique templates used in training games as available templates. Table 4 shows the game scores. Both TDQN+ and L-GAT largely decreased in score compared with Table 3, in which we trained and tested a specific agent for each game. Also, as expected, the zero-shot situation was more difficult. Whereas L-GAT outperformed TDQN+, the scores were far from achieving the goal. It was still challenging to obtain an environment-agnostic agent even though the action set itself is defined in a general manner.

Limitations and Future Work
Although L-GAT showed a better performance in our experiment, there is still room for improvement to enhance its language capability. We here discuss L-GAT's limitations and future improvements.
Experiments in environments other than textbased games. One of the advantages of L-GAT is its action generality. Jericho is a reasonable testbed for L-GAT because of the diverse games and strong NL interpreter. However, the performance in other environments such as dialogue systems (Dinan et al., 2019) and 3D video games (Gordon et al., 2018) will also be critical metrics because objectives for IF-game agents are not necessarily compatible with ones for real-world agents. Testing with real-world like environments will lead to insights for improving L-GAT.
Enhancement of expressive power. In our experiment, L-GAT had a high action coverage in the IF games. However, the format of generable actions is fixed, and we observed several critical disabilities with generating frequently used expressions. For example, L-GAT cannot cope with multi-word expressions (e.g., turn on, New York) and modified nouns (e.g., red cup, dog in the room). Note that we can make them generable without adding extra slots, such as by inserting multi-word expressions in the vocabulary and integrating a reference resolution module. A technique that increases the expressive power but keeps or decreases the action space would be a desirable enhancement.
Zero-shot learning in L-GAT The generalpurpose property of L-GAT can be seen as a zero-shot learning problem. Recently, Jain et al. 2020 proposed a new experiment to assess the adaptability of an RL agent to unseen states such as by using a new tool with the knowledge of known tools. L-GAT potentially can work for such a zeroshot situation. For example, let us assume that an L-GAT agent has a learned policy for attacking a monster with a hammer, and he then encounters a monster, but only a sword is available. Even if the agent did not know how to or even use a sword at all, he may be able to use it because L-GAT has knowledge on attacks with "weapon" (i.e., the Instrument role of the ATTACK template has a weapon node in WordNet, and "sword" is a hyponym of "weapon"; see Table 1 in Appendix A). Investigation into the adaptability to unseen states would be a promising analysis for L-GAT.

Connection to wide NLP resources and tools.
Although we developed L-GAT on top of general semantic schemes such as FrameNet, VerbNet, and WordNet, the required information for efficient decision-making cannot be covered with only those resources. Intelligent agents must comprehend common sense, causality, and worldknowledge (Luketina et al., 2019). Research on NLP has created various resources such as for retrieving information from Wikipedia (Chen et al., 2017), inferring the next action from a current state (Zellers et al., 2018), and using a commonsense database (Speer et al., 2017). Integration of such resources into L-GAT is critical future work, and the three general semantic schemes would be helpful for bridging different knowledge bases.

Conclusion
We proposed L-GAT as an environment-agnostic language-based action framework for RL agents. We designed L-GAT so that an agent can have a higher language capability to generate its own actions while keeping the action space tractable by using prior knowledge. Our experiment with multiple IF games showed that the L-GAT agent competitively performed against agents with gamespecific actions.
We discussed the current limitations of L-GAT and its future improvements.

Ethical Consideration
Our research is on defining a new action framework for RL agents. Our proposed method, L-GAT, is for generating actions of RL agents, which is unlikely to produce ethically problematic sentences such as those in hate speech. Also, we can control the vocabulary so that an agent with L-GAT does not produce such problematic sentences. Therefore, we consider the ethical risks of L-GAT to be low and controllable.

A Definitions of Action Templates
We manually defined 41 abstractive action templates in total, and Table 6 shows extended examples.
We give definitions of all of the templates at https://github.com/kohilin/ lgat.

B Masker Function Details
Input for LM masker. The input of the LM masker is an observation string suffixed with one of the verbs in the template. For example, in the case of GIVE with the observation "You have a cup of water. A boy stands in front of you.", the input for the n 1 slot is "... front of you. You gave the," and that for the n 2 slot is "... front of you. You gave it to the." Mask for primitive actions. L-GAT allows an agent to produce primitive actions, related to movement and belongings, regardless of the outputs of the masker function. Specifically, we forcibly assign 1 for direction words (up, down, north, south, east, west, northwest, northeast, southwest, southeast) to the n 1 slot in the GO template and all to the n 1 slots in the GET and DROP templates.

C Experimental Details
Hyperparameters. Table  5 shows our experimental hyperparameters and values searched for in the hyperparameter search. We tuned the α decay and initial , minimum , and decay for each game and each method. The other parameters were determined by testing with detective, temple, and zork1, and we applied the same settings to all of the games.
Machine specifications. We ran our experiment on Red Hat Enterprise Linux with an Intel(R) Xeon(R) CPU E5-2690 v4 at 2.60 GHz with 500GB of RAM and a single GPU, an NVIDIA TESLA K80.
Training time. Training with one game took approximately 10˜15 hours for the L-GAT agents, and 4˜6 hours for the TDQN+ agents depending on the game.
Model size. The number of trainable parameters for the L-GAT agent was about 10 million, which slightly changed depending on the vocabulary size of the games.  Walkthrough experiment. In the experiment, we investigated how many walkthrough actions were covered by L-GAT. The same transition can be possible with different action strings. Therefore, for each state, we tried all generable strings of L-GAT and checked that at least one of them could produce the same transition with the gold walkthrough action. To judge if two transitions were the same or not, we used the game state hash provided by Jericho. Several walkthrough actions did not change the world state hash. We excluded such no-effect walkthrough actions from the evaluation since completely different actions were classified as replaceable for them. Figure 6 shows learning curves of the L-GAT and TDQN+ agents in each game. Table 7 shows the best play trajectory of our L-GAT agent in zork1.  Table 6: Extended examples of abstractive action templates in L-GAT.