Revisiting the Roles of “Text” in Text Games

,


Introduction
In text-based games (Narasimhan et al., 2015;He et al., 2016;Hausknecht et al., 2019;Côté et al., 2018), players read text observation, command text actions to interact with a simulated world, and gain rewards as they progress through the story.From a reinforcement learning (RL) viewpoint, they are partially observable Markov decision processes (POMDP) -the current observation does not carry the full information of the game progress.In our Figure 1 example, visiting the Living Room before or after the dark place puzzle may yield * Equal contribution.
Observation: Kitchen.You are in the kitchen of the white house… There is a brass lantern (battery-powered) here… Look: <Same as Observation> Inventory: A glass bottle containing water.
Observation: You have moved into a dark place.The trap door crashes shut, and you hear someone barring it.It is pitch black.You are likely to be eaten by a grue.Look: It is pitch black.You are likely to be eaten by a grue.Inventory: A glass bottle containing water.A brown sack.
Observation: Living Room.Above the trophy case hangs an elvish sword of great antiquity.Look: Living Room.There is a doorway to the east, …, and a rug lying beside an open trap door.Inventory: A glass bottle containing water.A brown sack.the same observation, but only when informed by the game history, the player can decide whether go down is the right action.
Recent work has proposed to incorporate RL agents with natural language understanding (NLU) capabilities for better text game performance.For example, pre-trained language models support combinatorial action generation (Yao et al., 2020); commonsense reasoning (Murugesan et al., 2021), information extraction (Ammanabrolu and Hausknecht, 2020), and reading comprehension (Guo et al., 2020) provide priors for exploration with sparse reward and long horizon; and knowledge graph (Ammanabrolu and Hausknecht, 2020) and document retrieval (Guo et al., 2020) techniques help alleviate partial observability.
Nevertheless, Yao et al. (2021) doubts the need of NLU for RL agents trained and evaluated on the same game.They found that a text game agent, DRRN (He et al., 2016), performs even slightly better when RNN-based language representations are replaced with non-semantic hash codes.Intuitively, hash serves to memorize state-action pairs and ignore text similarities, which is sometimes useful -consider the second-to-last observation in Figure 1 and a counterfactual observation where a "lantern" is added into "Inventory", RNNs might encode them very similarly though they lead to antipodal consequences (die or explore the underground).How do we reconcile this with recent NLU-augmented text agents with improved performances?Where are semantic representations useful, and where would a hash approach suffice?
In this paper, we present initial findings that semantic and non-semantic language representations could work hand-in-hand better than each alone by targeting different RL challenges.Concretely, we show the hash idea could help DRRN tackle partial observability -returning to the Figure 1 example, to get lantern to avoid death, it is vital to know where the lantern is, which is revealed in a previous instead of current observation.Based on such intuition, we propose a simple algorithm that tracks the current location and the up-to-date descriptions of all locations, then encode them into a single approximate state hash vector as extra DRRN input.Though lightweight and easy-to-implement, such a representation plug-in improves DRRN scores by 29% across games, with competitive performances against state-of-the-art text agents using advanced NLU techniques and pre-trained Transformer models.The effectiveness is further confirmed by comparing to models that plug in groundtruth state or location hash codes, where we find our performance and these upper bounds with very little gaps.These results suggest that the current partial observability bottlenecks might not require advanced NLU models or semantic representations to conquer.However, such a message is gauged by the ablations that show the approximate state hash alone only achieves 58% of the full performance, as it fails to handle other RL challenges such as the combinatorial state and action spaces.In conclusion, we find the role of NLU in text games is not black-or-white as indicated by prior work, but rather differs for different RL challenges, and agents could benefit from combining semantic and non-semantic language representations that target different functionalities.Our results and insights contribute to future research in designing better tasks and models toward autonomous agents with grounded language abilities.

Problem Formulation
A text game can be formulated as a partially observable Markov decision process (POMDP) ⟨S, A, T, O, Ω, R, γ⟩, where at the t-th turn the agent reads a textual observation o t = Ω(s t ) ∈ O as a partial reflection of underlying world state s t ∈ S, issues a textual command a t ∈ A in response, and receives a sparse scalar reward r t = R(s t , a t ) in light of game progress.The state transition s t+1 = T (s t , a t ) is hidden to the agent.The goal is to maximize the expected cumulative discounted rewards E[ t γ t r t ].
Observations and States Following prior practice in the Jericho benchmark (Hausknecht et al., 2019), we augment the direct observation o t with inventory i t and location description l t obtained by issuing actions "inventory" and "look" respectively.But even this may not reveal the complete s t (Section 1), which is Jericho includes an object tree and a large simulator RAM array hidden to players.As s t is large and lacks interpretability, more often used is the state hash h(s t ), where h : S → N maps each state to an integer that can be used to probe if states are identical, but cannot provide semantic information about state differences.Access to s t or h(s t ) is a handicap in Jericho.

The DRRN Baseline and its Hash Variant
as the game context up to o t , and for convenience we omit the subscript t when no confusion is caused.Our baseline RL model, Deep Reinforcement Relevance Network (DRRN) (He et al., 2016), learns a Q-network Q(c t , a t ) = MLP(sr, ar) (1) where the state and action representations are encoded by gated recurrent units (GRU) (Cho et al., 2014).The temporal difference (TD) loss and Boltzmann exploration are used for RL.
In Yao et al. (2021), Eq. 2 is replaced by random, fixed, non-semantic hash representations where a hash vector function H = vec•h first maps inputs to integers (via Python built-in hash) then to random normal vectors (by using the integer as the generator seed).However, neither of the models addresses partial observability by using the context c t beyond the current observation o t .

Method
The key to handle partial observability is to extract the appropriate state-distinguishing information from the context c t -while under-extraction leads to different states with same representations, over-extraction leads to diverging representations for the same state with different history paths.So to approximate the state hash, we first obtain and maintain a location map by exploration with limited depth d, collecting names of adjacent rooms: where p is a sequence of navigation actions, and loc is the location after following p from c t .Essentially, po 1 (c t ) serves to distinguish different locations with same names. 1econdly, we collect the most-recent location descriptions for all locations, so that we may know, for example, the whereabouts of the lantern when needed (Section 1).
Together, our model DRRN-LocationGraph (LOG) takes state representation The algorithm details are in Appendix A.

Experiments
Implementation Details We adopt DRRN hyperparameters from Yao et al. (2021) to train our model.Following previous work, we implement the BiDAF (Seo et al., 2016) attention mechanism and the inverse dynamics auxiliary objective (Yao et al., 2021) for better text encoding.The episodic limit is 100 steps and the training has 1,000 episodes from 8 parallel game environments.
For po 1 , we use d = 1 as depth limit.We train three independent runs for each game.More details are in Appendix B.
Baselines Our approach builds on the backbone DRRN agent, thus we provide fair comparisons to the original DRRN and its hash and inverse dynamics variants from Yao et al. (2021).We also compare with more complex state-of-the-art agents that are designed to deal with the partial observability via NLU: • MPRC-DQN (Guo et al., 2020), which retrieves the relevant history to enhance the current observation, and formulates the action prediction as a multi-passage reading comprehension problem.
• KG-A2C (Ammanabrolu and Hausknecht, 2020;Ammanabrolu et al., 2020), which extracts an object graph with OpenIE (Angeli et al., 2015) or a BERT-based QA model (Devlin et al., 2019), and embeds the graph to a single vector as the state representation.We compare with the better result from the two papers for each game.
Evaluating Games We select 6 games from Jericho (Hausknecht et al., 2019) where MPRC-DQN or KG-A2C exhibits performance boosts, thus are more likely to suffer from partial observability.

Game Results
Table 1 shows game scores for all models.Among DRRN and it variants, DRRN-LOG performs best on 4 of the 6 games.More impressively, our agent is competitive against MPRC-DQN (better or same score on 3/6 games) and KG-A2C (better scores on 4/6 games) in terms of winning rates.Overall, our DRRN-LOG achieves the second best average normalized score of 36%, only behind 41% of MPRC-DQN (which is largely attributed to Zork III).Considering the fact that we explicitly choose the six games in favor of these two state-of-theart baselines, such a result indicates that advanced NLU techniques might not be a must to solve partial observability -at least in the scoring ranges of current text game agents (i.e.average normalized score less then 50%).

Oracle Analysis with Groundtruth States
Next, we aim to study the performance gap between our model, and an oracle version that replaces our approximate state hash with the groundtruth state hash (GT-State) from Jericho.The GT-States could perfectly distinguish different states apart, where sr gt = (sr drrn , H(s)).help these agents with RL challenges other than partial observability.Finally, we also show in Appendix C the performance of replacing our state approximation with the groundtruth room IDs (GT-Room), where our agent achieves on-par or better results on all the games.This confirms that our state approximation not only effectively identifies player locations by Eq. 4, but also brings richer state information thanks to Eq. 5.

Ablation Studies
In proposed in Yao et al. (2021), note that Eq. 3 still leverages the compositional structure of (o t , i t , l t ), e.g. two states with the same i t still share part of the state representation.Such a result helps confirm the importance of language for the RL challenge of large observation and actions spaces: semanticspreserving function approximation (e.g.RNN instead of hash) could be key to interpolation (smooth value estimation for similar states) as well as extrapolation (efficient exploration based on language and commonsense priors).Finally, we ablate individual components of DRRN-LOG on Zork I. Figure 2 shows that removing the language-learning auxiliary task of inverse dynamics (w/o invdy) or the language attention (w/o att) leads to worse scores, reconfirming that semantic language representations are vital for DRRN-LOG's success.On the other hand, removing the current whereabouts (w/o cur_room) leads to much worse performance than removing location descriptions across the map (w/o last_look), suggesting location identification Eq. 4 might be more important for solving partial observability.sity of NLU in different dimensions, which would in turn help identify flaws of current setups and propose better ones.We also hope our idea of best combing semantic and non-semantic language representations could be useful for building nextgeneration text game agents, as well as for other language applications with memorization needs like closed-domain QA or goal-oriented dialog.

Limitations
Our approach to retrieve the global state focuses on different locations.The simplicity of our method can help prove the value and importance of involving non-semantic representations in text-based games.Also our hash-based non-semantic representation hide the difference in global state retrieval methods, as long as they can successfully distinguish different states.However, we acknowledge that more detailed designs is needed in order to generalize our method to other TBGs.
Another limitation is that our method focuses on text fictions, a specific type of text-based games.Most games of this type have lots of locations to explore.As a result, our location-based approach can successfully distinguish different states.Although the direct usage of our approach is limited, we believe the innovation of combining semantic and non-semantic representations is helpful in other text-based games and NLP tasks.

A Algorithm of Our Approximated State Representation Construction
Algorithm 1 Infer the current location with nearby room names using depth-first search with limited depth.This can help distinguish different rooms with the same name in most cases.We use depth = 1 in our runs.

B More Details of Our Model and Implementation
We follow the hyperparameters from Yao et al. (2021).For the state approximation part, we use the builtin HASH function in Python.We train our model for 10 5 steps, which takes about 40 hours on a TITAN X or Geforce GTX 1080.
We use the latest Jericho version 3.1.0.Due to a bug in Zork I, we add a timeout in the library to filter out valid actions causing the emulator to hang.

B.1 Details of Our BiDAF Observation Encoder
In DRRN, the GRU takes the responsibility of both memorizing the high-scoring trajectories, and generalizing to unseen observations.In our method, the memorization power can be provided with our hash codes of local graphs, with stronger ability to distinguish states.We thus hope to encourage the generalization strength of neural network; and propose the attentive extension of observation embedding.
Our key idea bases on the insight that the Qvalue in DRRN is computed by matching the textual observations to a textual action.Since the observations are usually significantly longer than the actions, the effect of an action can usually be determined by its interaction with a local context in the observation.This can be naturally modeled with the attention mechanism.Specifically, we apply the BiDAF (Seo et al., 2016) to match each observation component to the action.
The BiDAF takes the observation and action embeddings; and outputs an action-attended observation embedding.We denote the GRU embeddings for observation word i and action word j as o i and a j .The attention score from an observation word to an action word is thus α ij = exp(a ij )/ j exp(a ij ), where a ij =o T i a j .We then compute the "action-to-observation" summary vector for the i-th observation word as c i = j α ij a j .We concatenate and project the output vectors as o i , c i , o i ⊙ c i , |o i − c i | , followed by a linear layer with leaky ReLU activation units.We apply the aforementioned steps to the inventory i and location appearance l, too.Finally, we have

C Additional Experiments with Oracle State Information
We investigate of performance of replacing our state approximation with the groundtruth room IDs (GT-Room).Specifically, instead of using our state representation or the groundtruth state ID, we use the groundtruth room ID the player is located in.The room ID has a much lower dimension compared to the state ID; and is a simple yet strong indicator for game playing.However, it is only a shortcut for some game states.For fundamentally dealing  with partial observability, the state representation should be able to capture more information beyond room IDs.
The experiment here is designed to verify whether our proposed representation can learn more than room IDs.Intuitively, if a model only learns to mimic room IDs, its performance will be strictly below the oracle GT-Room results.While our results in Table 3 show that our model is on par in terms of average measure and is even higher in terms of the maximum measure across all the games.This confirms that our state approximation not only identifies the true location of the player, but also brings richer state information.
Together with the GT-State results, we proved the advantage of our proposed state representation, as it is on par or better compared to both the sufficient but sparse state-ID oracle and the less informative but smoothed room-ID oracle.

D Zork I Trajectory of DRRN-LOG
Obs0: Copyright (c) 1981Copyright (c) , 1982Copyright (c) , 1983 Infocom, Inc.All rights reserved.ZORK is a registered trademark of Infocom, Inc. Revision 88 / Serial number 840726 West of House You are standing in an open field west of a white house, with a boarded front door.There is a small mailbox here.Inv: you are empty handed.Desc: west of house you are standing in an open field west of a white house, with a boarded front door.there is a small mailbox here.Action0: north, Score 0 Obs1: North of House You are facing the north side of a white house.There is no door here, and all the windows are boarded up.To the north a narrow path winds through the trees.Inv: you are empty handed.Desc: north of house you are facing the north side of a white house.there is no door here, and all the windows are boarded up. to the north a narrow path winds through the trees.
Action1: east, Score 0 Obs2: Behind House You are behind the white house.A path leads into the forest to the east.In one corner of the house there is a small window which is slightly ajar.Inv: you are empty handed.Desc: behind house you are behind the white house.a path leads into the forest to the east. in one corner of the house there is a small window which is slightly ajar.Action2: open window, Score 0 Obs3: With great effort, you open the window far enough to allow entry.Inv: you are empty handed.Desc: behind house you are behind the white house.a path leads into the forest to the east. in one corner of the house there is a small window which is open.Action3: west, Score 10 Obs4: Kitchen You are in the kitchen of the white house.A table seems to have been used recently for the preparation of food.A passage leads to the west and a dark staircase can be seen leading upward.A dark chimney leads down and to the east is a small window which is open.On the table is an elongated brown sack, smelling of hot peppers.A bottle is sitting on the table.The glass bottle contains: A quantity of water Inv: you are empty handed.Desc: kitchen you are in the kitchen of the white house.a table seems to have been used recently for the preparation of food.a passage leads to the west and a dark staircase can be seen leading upward.a dark chimney leads down and to the east is a small window which is open.on the table is an elongated brown sack, smelling of hot peppers.a bottle is sitting on the table.the glass bottle contains: a quantity of water Action4: west, Score 10 Obs5: Living Room You are in the living room.There is a doorway to the east, a wooden door with strange gothic lettering to the west, which appears to be nailed shut, a trophy case, and a large oriental rug in the center of the room.Above the trophy case hangs an elvish sword of great antiquity.A battery powered brass lantern is on the trophy case.Inv: you are empty handed.Desc: living room you are in the living room.there is a doorway to the east, a wooden door with strange gothic lettering to the west, which appears to be nailed shut, a trophy case, and a large oriental rug in the center of the room.above the trophy case hangs an elvish sword of great antiquity.a battery powered brass lantern is on the trophy case.Action5: push rug, Score 10 Obs6: With a great effort, the rug is moved to one side of the room, revealing the dusty cover of a closed trap door.Inv: you are empty handed.Desc: living room you are in the living room.there is a doorway to the east, a wooden door with strange gothic lettering to the west, which appears to be nailed shut, a trophy case, and a closed trap door at your feet.above the trophy case hangs an elvish sword of great antiquity.a battery powered brass lantern is on the trophy case.Action6: take all, Score 10 Obs7: trophy case: The trophy case is securely fastened to the wall.sword: Taken.brass lantern: Taken.carpet: The rug is extremely heavy and cannot be carried.Inv: you are carrying: a brass lantern a sword Desc: living room you are in the living room.there is a doorway to the east, a wooden door with strange gothic lettering to the west, which appears to be nailed shut, a trophy case, and a closed trap door at your feet.Action7: open trap, Score 10 Obs8: The door reluctantly opens to reveal a rickety staircase descending into darkness.Inv: you are carrying: a brass lantern a sword Desc: living room you are in the living room.there is a doorway to the east, a wooden door with strange gothic lettering to the west, which appears to be nailed shut, a trophy case, and a rug Obs12: The Troll Room This is a small room with passages to the east and south and a forbidding hole leading west.Bloodstains and deep scratches (perhaps made by an axe) mar the walls.A nasty looking troll, brandishing a bloody axe, blocks all passages out of the room.Your sword has begun to glow very brightly.Inv: you are carrying: a brass lantern (providing light) a sword the axe crashes against the rock, throwing sparks!Desc: the troll room this is a small room with passages to the east and south and a forbidding hole leading west.bloodstains and deep scratches (perhaps made by an axe) mar the walls.a nasty looking troll, brandishing a bloody axe, blocks all passages out of the room.the axe crashes against the rock, throwing sparks!Action12: hit troll with sword, Score 35 Obs13: The haft of your sword knocks out the troll.Inv: you are carrying: a brass lantern (providing light) a sword Desc: the troll room this is a small room with passages to the east and south and a forbidding hole leading west.bloodstains and deep scratches (perhaps made by lantern (providing light) Desc: maze this is part of a maze of twisty little passages, all alike.Action26: southeast, Score 45 Obs27: Cyclops Room This room has an exit on the northwest, and a staircase leading up.A cyclops, who looks prepared to eat horses (much less mere adventurers), blocks the staircase.From his state of health, and the bloodstains on the walls, you gather that he is not very friendly, though he likes people.Inv: you are carrying: a brass lantern (providing light) Desc: cyclops room this room has an exit on the northwest, and a staircase leading up.a cyclops, who looks prepared to eat horses (much less mere adventurers), blocks the staircase.from his state of health, and the bloodstains on the walls, you gather that he is not very friendly, though he likes people.Action27: odysse, Score 45 Obs28: The cyclops, hearing the name of his father's deadly nemesis, flees the room by knocking down the wall on the east of the room.Inv: you are carrying: a brass lantern (providing light) Desc: cyclops room this room has an exit on the northwest, and a staircase leading up.the east wall, previously solid, now has a cyclops sized opening in it.Action28: up, Score 70 Obs29: You hear a scream of anguish as you violate the robber's hideaway.Using passages unknown to you, he rushes to its defense.The thief gestures mysteriously, and the treasures in the room suddenly vanish.Treasure Room This is a large room, whose east wall is solid granite.A number of discarded bags, which crumble at your touch, are scattered about on the floor.There is an exit down a staircase.There is a suspicious looking individual, holding a large bag, leaning against one wall.He is armed with a deadly stiletto.There is a silver chalice, intricately engraved, here.The thief draws blood, raking his stiletto across your arm.Inv: you are carrying: a brass lantern (providing light) you dodge as the thief comes in low.Desc: treasure room this is a large room, whose east wall is solid granite.a number of discarded bags, which crumble at your touch, are scattered about on the floor.there is an exit down a staircase.there is a suspicious looking individual, holding a large bag, leaning against one wall.he is armed with a deadly stiletto.there is a silver chalice, intricately engraved, here.you dodge as the thief comes in low.Action29: temple, Score 70 Obs30: Temple This is the north end of a large temple.On the east wall is an ancient inscription, probably a prayer in a long forgotten language.Below the prayer is a staircase leading down.The west wall is solid granite.The exit to the north end of the room is through huge marble pillars.There is a brass bell here.Inv: you are carrying: a brass lantern (providing light) Desc: temple this is the north end of a large temple.on the east wall is an ancient inscription, probably a prayer in a long forgotten language.below the prayer is a staircase leading down.the west wall is solid granite.the exit to the north end of the room is through huge marble pillars.there is a brass bell here.Action30: throw lantern at prayer, Score 70

Figure 1 :
Figure 1: A game trajectory from Zork I.

Table 1 :
Table 2, the scores of DRRN-LOG and the GT-State are very close across different games, meaning our approximation has been closeto-perfect within the state hashing scheme.Notably, even GT-State fails to totally surpass MPRC-DQN or KG-A2C, suggesting NLU techniques might Final episodic/maximum explored scores for different games.MPRC-DQN numbers with max scores correspond to version change of games, so we re-run their model and report the new results.Average normalized score (avg norm) is model score divided by maximum game score, averaged across games.

Table 2 :
The result in Table2shows a huge performance drop for DRRN-LOG and its GT-State version with text encoders removed -in other words, learning the text game as a tabular MDP without language semantics could lead to a much deteriorated sample complexity, even when partial observability is solved.To explain why DRRN with GT-State hash is much worse than DRRN with observation hash The results of replacing our state representations with groundtruth state IDs (GT-State Full Model), as well as removing the text encoder (-Text Enc).

Table 3 :
The results of replacing our state representations with groundtruth state IDs (GT-State Full Model), as well as removing the text encoder (-Text Enc).
Opened.Inv: you are carrying: a brass lantern a sword Desc: living room you are in the living room.there is a doorway to the east, a wooden door with strange gothic lettering to the west, which appears to be nailed shut, a trophy case, and a rug lying beside an open trap door.Action9: down, Score 35 Obs10: You have moved into a dark place.The trap door crashes shut, and you hear someone barring it.It is pitch black.You are likely to be eaten by a grue.Your sword is glowing with a faint blue glow.Inv: you are carrying: a brass lantern a sword Desc: it is pitch black.you are likely to be eaten by a grue.Action10: turn on lantern, Score 35 Obs11: The brass lantern is now on.Cellar You are in a dark and damp cellar with a narrow passageway leading north, and a crawlway to the south.On the west is the bottom of a steep metal ramp which is unclimbable.Inv: you are carrying: a brass lantern (providing light) a sword Desc: cellar you are in a dark and damp cellar with a narrow passageway leading north, and a crawlway to the south.on the west is the bottom of a steep metal ramp which is unclimbable.Action11: north, Score 35