Efficient Text-based Reinforcement Learning by Jointly Leveraging State and Commonsense Graph Representations

Text-based games (TBGs) have emerged as useful benchmarks for evaluating progress at the intersection of grounded language understanding and reinforcement learning (RL). Recent work has proposed the use of external knowledge to improve the efficiency of RL agents for TBGs. In this paper, we posit that to act efficiently in TBGs, an agent must be able to track the state of the game while retrieving and using relevant commonsense knowledge. Thus, we propose an agent for TBGs that induces a graph representation of the game state and jointly grounds it with a graph of commonsense knowledge from ConceptNet. This combination is achieved through bidirectional knowledge graph attention between the two symbolic representations. We show that agents that incorporate commonsense into the game state graph outperform baseline agents.


Introduction
Text-based games (TBGs) are simulation environments in which an agent interacts with the world purely in the modality of text. TBGs have emerged as key benchmarks for studying how reinforcement learning agents can tackle the challenges of language understanding, partial observability, and action generation in combinatorially large action spaces. One particular text-based gaming environment, TextWorld (Côté et al., 2018), has received significant attention in recent years.
Recent work has shown the need for additional knowledge to tackle the challenges in TBGs. Ammanabrolu and Riedl (2019) proposed handcrafted rules to represent the current state of the game using a state knowledge graph (much like a map of the game). Our own prior work (Murugesan et al., 2021) proposed an extension of TextWorld, called TextWorld Commonsense (TWC), to test agents' ability to use commonsense knowledge while interacting with the world. The hypothesis behind TWC Observation: You've entered a kitchen. You see a dishwasher and a fridge. Here's a dining  Figure 1: An illustration of a TBG that requires both the state representation of the game as well as the external commonsense knowledge for efficient exploration and learning the best action trajectory. The observation text feeds into the state and commonsense graphs; and the best action trajectory is computed based on information from both graphs.
is that commonsense knowledge allows the agent to understand how current actions might affect future world states; and enable look-ahead planning (Juba, 2016), thus leading to sample-efficient selection of actions at each step and driving the agent closer to optimal performance.
In this paper, we posit that to efficiently act in such text-based gaming environments, an agent must be able to effectively track the state of the game, and use that to jointly retrieve and leverage the relevant commonsense knowledge. For example, commonsense knowledge such as apple should be placed in the refrigerator would help the agent to act closer to the optimal behavior; whereas state information like apple is on the table would help the agent plan more efficiently. Thus, we propose a technique to: (a) track the state of the game in the form of a symbolic graph that represents the agent's current belief of the state of the world (Ammanabrolu and Hausknecht, 2020a; Adhikari et al., 2020); (b) retrieve the relevant commonsense knowledge from ConceptNet (Speer et al., 2017), and (c) jointly leverage the state graph and the retrieved commonsense graph. This combined information is then used to select the optimal action. Finally, we demonstrate the performance of our agent against state of the art baseline agents on the TWC Environment.

Related Work
Text-based reinforcement learning Text-based games have recently emerged as a promising framework to drive advances in RL research. Prior work has explored text-based RL to learn strategies based on an external text corpus (Branavan et al., 2012) or from textual observations (Narasimhan et al., 2015). In both cases, the text is analyzed and control strategies are learned jointly using feedback from the gaming environment. Zahavy et al. (2018) proposed the Action-Elimination Deep Q-Network (AE-DQN), which learns to classify invalid actions to reduce the action space. The use of the commonsense and state graph in our work has the same goal of down-weighting implausible actions by jointly reasoning over the state of the game and prior knowledge. Recently, Côté et al. (2018) introduced TextWorld and Murugesan et al. (2021) proposed TextWorld Commonsense (TWC), a textbased gaming environment which requires agents to leverage prior knowledge in order to solve the games. In this work, we build on the agents of Murugesan et al. (2021) and show that prior knowledge and state information are complementary and should be learned jointly.
KG-based state representations A recent line of work in TBGs aims at enhancing generalization performance by using symbolic representations of the agent's belief. Notably, Ammanabrolu and Riedl (2019) proposed KG-DQN and Ammanabrolu and Hausknecht (2020b) proposed KG-A2C. The idea behind both approaches is to represent the game state as a belief graph. Recently, Adhikari et al. (2020) proposed the graph-aided transformer agent (GATA), an approach to construct and update a latent belief graph during planning. Our work integrates these graph-based state representations with a prior commonsense graph that allows the agent to better model the state of the game using prior knowledge.
Sample-efficient reinforcement learning A key challenge for current RL research is low sample efficiency (Kaelbling et al., 1998). To address this problem, there have been few attempts on adding prior or external knowledge to RL approaches. Notably, Murugesan et al. (2020) proposed to use prior knowledge extracted from ConceptNet. Garnelo et al. (2016) proposed Deep Symbolic RL, which relies on techniques from symbolic AI as a way to introduce commonsense priors. There has also been work on policy transfer (Bianchi et al., 2015) which aims at reusing knowledge gained in different environments. Moreover, Experience replay (Wang et al., 2016;Lin, 1992Lin, , 1993) provides a framework for how previous experiences can be stored and later reused. In this paper, following Murugesan et al. (2020), we use external KGs as a source of prior knowledge and we combine this knowledge representation with graph-based state modeling in order to allow the agents to act more efficiently.

Model & Architecture
TBGs can be framed as partially observable Markov decision processes (POMDPs) (Spaan, 2012) denoted S, A, O, T, E, r , where: S denotes the set of states, A denotes the action space, O denotes the observation space, T denotes the state transition probabilities, E denotes the conditional observation emission probabilities, and r : S × A → R is the reward function. The observation o t at time step t depends on the current state. Both observations and actions are rendered in text. The agent receives a reward at every time step t: r t = r(o t , a t ), and the agent's goal is to maximize the expected discounted sum of rewards: The high-level architecture of our model contains three major components: (a) the input encoder; (b) a graph-based knowledge extractor; and (c) the action prediction module. The input encoding layers are used to encode the observation o t at time step t and the list of admissible actions using GRUs (Ammanabrolu and Hausknecht, 2020a). The graph-based knowledge extractor collects relevant knowledge from complementary knowledge sources: the game state, and external commonsense knowledge. We allow information from each knowledge source to guide and direct better representation learning for the other.
Recent efforts have demonstrated the use of primarily two different types of knowledge sources for TextWorld RL Agents. A State Graph (SG) captures state information (Ammanabrolu and Riedl, 2019) about the environment represented via a language-based semantic graph. The example in Figure 2 shows that information such as Apple → Apple Inventory Plate Table   ! "

Commonsense Knowledge
. .  (2019) create such knowledge graphs by extracting information using Ope-nIE (Angeli et al., 2015) and some manual heuristics. A Commonsense Graph (CG) captures external commonsense knowledge (Murugesan et al., 2021) between entities (from commonsense knowledge sources such as ConceptNet). We posit that RL agents can make use of information from both these graphs during different sub-tasks, enabling efficient learning. The SG provides the agent with a symbolic way of representing its current perception of the game state, including its understanding of the surroundings. On the other hand, the CG provides the agent with complementary human-like knowledge about what actions make sense in a given state, thus enabling more efficient exploration of the very large natural language based action space. We combine the state information with commonsense knowledge using a Bidirectional Knowledgegraph attEntion (BiKE) mechanism, which recontextualizes the state and commonsense graphs based on each other for optimal action trajectories. Figure 2 provides a compact visualization.

Knowledge Integration using BiKE
The aforementioned graph-based knowledge extractor produces M entities (c 1 t , c 2 t , · · · , c M t ) for the commonsense graph (CG); and N entities (s 1 t , s 2 t , · · · , s N t ) for the state graph (SG). Note that the entities extracted for the CG are based on the vocabulary used in ConceptNet, and may not necessarily have the same set of entities as the SG (Figure 1). We embed the extracted entities in both graphs using Numberbatch (Liu and Singh, 2004). We then encode these graph representations using a Graph Attention Network (GAT) (Veličković et al., 2018). GAT allows the node entities s t and c t within the graphs G S t and G C t respectively to share information among each other by message passing.
We then integrate sub-graphs extracted from the previous steps to improve the agent's exploration strategy. Inspired from bidirectional attention mechanism in QA (Seo et al., 2016), we use BiKE attention mechanism between G S t and G C t to fuse the knowledge from these two graphs. The information flow across the graphs allows the model to learn commonsense-aware state graph representations, and state-aware commonsense knowledge graph representations.
To implement this, we compute a graph similarity matrix S ∈ R N×M across the graph entities to learn a state-to-commonsense graph attention function and a commonsense-to-state graph attention function. S i j = f (s i t , c j t ) captures how each node s i t in the graph G S t is linked to a node c j t in the other graph G C t , and vice versa. Here f is a learnable function that maps s i t and c j t to a similarity score. This allows us to measure the similarity between (for instance) Apple observed in the state graph and Apple observed in the commonsense graph. We compute the state-to-commonsense graph attention values A by taking a softmax along the rows of S: this signifies the attention bestowed by each state graph node on the nodes of the commonsense graph. Similarly, we compute the commonsense-tostate graph attention valuesĀ by taking a softmax along the columns of S. We capture the relevant knowledge in the commonsense graph G t C by updating the state representationss i t . We compute the updated state representation as: and g is a learnable function that maps the concatenated s i t , s i t , ands i t to an updated state representation. Finally, we use the general attention between the o t and the state graph entities s t+1 to get the state graph representation g S t+1 (Luong et al., 2015). We perform a similar process for the commonsense-tostate graph attention and obtain the commonsense graph representation: g C t+1 . We select the relevant action by computing an attention over the actions: ; where h is a learnable function that projects the concatenation o t , a i t , g S t+1 , g C t+1 to the attention score for the i th action.

Experiments
We generate a set of games with 3 difficulty levels using the TWC (Murugesan et al., 2021) framework: (i) easy level, which has 1 room containing 1 to 3 objects; (ii) medium level, which has 1 or 2 rooms with 4 or 5 objects; and (iii) hard level, a mix of games with a high number of objects (6 or 7 objects in 1 or 2 rooms) or high number of rooms (3 or 4 rooms containing 4 or 5 objects). We compare 5 text-based RL agents: (a) a textonly agent (Text), which selects the best action based only on the encoding of the history of observations; (b) DRRN (He et al., 2016;Narasimhan et al., 2015), which relies on the relevance between the observation and action spaces; (c) an agent enhanced with access to an external commonsense knowledge graph (+Commonsense) (Murugesan et al., 2021); (d) an agent that, following Ammanabrolu and Hausknecht (2020a), models the state of the world as a symbolic graph (+State); and (e) the agent (BiKE) described in Section 3, which relies on both state and commonsense graph representations. The agents are trained over 100 episodes with a 50-step maximum. All policies are learned using Actor-Critic .

Improving Performance with State and
Commonsense Knowledge Figure 3 shows the learning curves for the text-only agent and the agents equipped with state and/or commonsense graph representations at training time. For reference, we also report the performance of an agent that selects a random action at each time step (Random). We notice that, overall, agents equipped with either state or commonsense graph representations perform better than their text-only counterparts, both in terms of the number of steps taken and the normalized score. In particular, the BiKE agent outperforms all other agents in all difficulty levels, showing that symbolic state representations and prior commonsense knowledge can be jointly used for better sample efficiency and results. 28.34 ± 3.63 0.80 ± 0.07 41.61 ± 0.37 0.59 ± 0.01 50.00 ± 0.00 0.21 ± 0.00 +State + Commonsense (BiGAF) 25.59 ± 1.92 0.83 ± 0.01 39.34 ± 0.72 0.61 ± 0.01 50.00 ± 0.00 0.23 ± 0.02 Table 1: Test-set performance results for within distribution (IN) and out-of-distribution (OUT) games.
(a) Average relevance of the main action templates to the state and commonsense graphs across the hard games (b) Example of most relevant graphs and nodes by action taken in an excerpt of a game in the hard difficulty level Figure 4: Analysis of the relevance given to the state and commonsense graphs (a) and to their nodes (b) by action taken 4.1 Improving RL Performance with State and Commonsense Knowledge Figure 3 shows the learning curves for the text-only agent and the agents equipped with state and/or commonsense graph representations at training time. For reference, we also report the performance of an agent that selects a random action at each time step (Random). We notice that, overall, agents equipped with either state or commonsense graph representations perform better than their text-only counterparts, both in terms of the number of steps taken and the normalized score. In particular, the BiGAF agent (defined in Section 2) outperforms all other agents in all difficulty levels, showing that symbolic state representations and prior commonsense knowledge can be jointly used for better sample efficiency and results. Table 1 shows the performance of the agents on the test set. Following Murugesan et al. (2021), we compared our agents on two test sets: (IN) uses the same entities as the training set, and (OUT) uses entities that were not included in the training set. From Figure 3 and Table 1, we notice that the +Commonsense agent performs better on the easy level, whereas the +State agent performs better on the medium and hard levels. This suggests that the state representation can be leveraged to drive exploration and interaction with objects in environments with multiple rooms; whereas prior commonsense knowledge allows the agent to act more efficiently by selecting the appropriate commonsensical locations of different objects. In order to investigate this hypothesis, we computed the av-erage importance given by the agent to the state graph and the commonsense graph when selecting the different action templates shown in Figure 4a. For each action template, the figure shows the normalized attention weight given to the two graphs, averaged across 5 runs of all games in the hard difficulty level. We notice that actions requiring information about the goal of the game, like the put and insert actions, benefit more from attending to the commonsense graph; whereas actions aimed at exploring the environment and collecting objects, like the go and take actions, benefit more from the state representation.

Qualitative Analysis
As a further qualitative analysis, we report in Figure 4b an example of the most attended nodes and graphs in an excerpt of a game belonging to the medium difficulty level. As noted above, the take and go actions rely more on the state graph, whereas the insert action relies on the commonsense graph. Among the nodes in these graphs, the entities mentioned in the action receive the highest attention score.

Conclusion
We hypothesize that in order to be sample-efficient in text-based games, agents must be able to jointly track the state of the game and retrieve the relevant commonsense knowledge. We proposed a technique that models both forms of knowledge as graphs and combines them using a novel graph coattention mechanism. We show that the resulting agent is more sample-efficient than approaches that consider neither or only one of these graphs.
(b) Example of most relevant graphs and nodes (by action taken) for one example game excerpted from the hard difficulty level. Figure 4: Relevance given to the: (a) state and commonsense graphs; and to (b) their nodes (by action taken). The experimental results show that the BiKE agent generalizes better than all the baselines across the 3 difficulty levels.

Qualitative Analysis
From Figure 3 and Table 1, we notice that the +Commonsense agent performs better on the easy level, whereas the +State agent performs better on the medium and hard levels. This suggests that the state representation can be leveraged to drive exploration and interaction with objects in environments with multiple rooms; whereas prior commonsense knowledge allows the agent to act more efficiently by selecting the appropriate commonsensical locations of different objects. In order to investigate this hypothesis, we computed the average importance given by the agent to the state graph and the commonsense graph when selecting the different action templates shown in Figure 4a. For each action template, the figure shows the normalized attention weight given to the two graphs, averaged across 5 runs of all games in the hard difficulty level. Actions requiring information about the goal of the game, like put and insert, benefit more from attending to the commonsense graph; whereas actions aimed at exploring the environment and collecting objects, like go and take, benefit more from the state representation.
As further qualitative analysis, we report an example of the most attended nodes and graphs from an excerpt of a game belonging to the hard difficulty level in Figure 4b. As noted above, the take and go actions rely more on the state graph, whereas the insert action relies on the commonsense graph. Among the nodes in these graphs, the entities that are finally mentioned in the action receive the highest attention score. This shows how our agent is able to transfer the bidirectional attention over graphs into specific game instances.

Conclusion
In this paper, we showed that in order to be sampleefficient in TBGs, agents must be able to jointly track the state of the game and relevant commonsense knowledge. We proposed a technique that models both forms of knowledge as graphs, and combines them using Bidirectional Knowledgegraph attEntion (BiKE). The resulting agent was found to be more sample-efficient than approaches that considered neither or only one of these graphs.