Interactive Machine Comprehension with Dynamic Knowledge Graphs

Interactive machine reading comprehension (iMRC) is machine comprehension tasks where knowledge sources are partially observable. An agent must interact with an environment sequentially to gather necessary knowledge in order to answer a question. We hypothesize that graph representations are good inductive biases, which can serve as an agent’s memory mechanism in iMRC tasks. We explore four different categories of graphs that can capture text information at various levels. We describe methods that dynamically build and update these graphs during information gathering, as well as neural models to encode graph representations in RL agents. Extensive experiments on iSQuAD suggest that graph representations can result in significant performance improvements for RL agents.


Introduction
Machine reading comprehension (MRC) has gathered wide interest from the NLP community in recent years. It serves as a way to benchmark a system's ability to understand and reason over natural language. Typically, given a knowledge source such as a document, a model is required to read through the knowledge source to answer a question about some information contained therein. In the extractive QA paradigm, in particular, answers are typically sub-strings of the knowledge source (Rajpurkar et al., 2016;Trischler et al., 2016a;Yang et al., 2018a). Models are thus required to select a span from the knowledge source as their prediction.
A recent line of work known as interactive machine reading comprehension (iMRC) features interactive language learning and knowledge acquisition (Yuan et al., 2020;Ferguson et al., 2020). It shifts the focus of MRC research towards a more realistic setting where the knowledge sources (environments) are partially observable. Under this setting, agents must iteratively interact with the environment to discover necessary information in order to answer the questions. The sequence of interactions between an agent and the environment may resemble the agent's reasoning path, rendering a higher level of interpretability in the agent's behaviour. The trajectories of interactions can also be seen as procedural knowledge, which potentially brings agents extra generalizability (humans do not necessarily know the answer to a question immediately, but they know the procedure to search it). Compared to many static MRC datasets, where the entire knowledge source (e.g., a paragraph in SQuAD) is presented to the model immediately, the iMRC setting may alleviate the risk of learning shallow pattern matching (Sugawara et al., 2018;Sen and Saffari, 2020).
On a parallel track, there have been a plethora of studies that leverage graphs in MRC. Multiple linguistic features have been explored to help construct graphs, such as coreference (Dhingra et al., 2018;Song et al., 2018), entities (De Cao et al., 2019;Qiu et al., 2019;Tu et al., 2020) and semantic roles (Zheng and Kordjamshidi, 2020). In related areas such as vision-and text-based games, prior works also attempt to build implicit and explicit graphs to encode data from various types of modalities (Johnson, 2017;Ammanabrolu and Riedl, 2019;Kipf et al., 2020;Adhikari et al., 2020). All of these works, covering domains from static MRC to sequential decision making, suggest that graph representations can facilitate model learning. This gives us a strong motivation to leverage graph representations in the iMRC setting.
We hypothesize that graph representations are good inductive biases, since they can serve naturally as a memory mechanism to help RL agents tackle partial observability in iMRC tasks. We develop an agent that can dynamically update new information into a graph representation at every step, the agent integrates information from both text and graph modalities to make decisions. The main contributions of this work are as follows: 1. We propose four categories of graph representations, each capturing text information from a unique perspective; we demonstrate how to generate and maintain the graphs dynamically in the iMRC tasks.
2. We extend the RL agent proposed in (Yuan et al., 2020) by adding a graph encoding mechanism and a recurrent memory mechanism.
3. We conduct extensive experiments and show that the proposed graph representations can greatly boost agent's performance on iMRC tasks.

Problem Setting
We follow the iMRC setting (Yuan et al., 2020), where given an environment consisting of a partially observable document and a question, an agent needs to sequentially interact with the environment to discover necessary information and then answer the question. The iMRC paradigm reformulates existing MRC datasets (e.g., SQuAD) into interactive environments by occluding most parts of their documents. A set of commands are defined to help agents reveal glimpses of the hidden documents. iMRC can be seen as a controllable simulation to a family of complex real world environments, where knowledge sources are partially observable yet easily accessible by design through interactions. One such example is the Internet, where humans can efficiently navigate through keywords and links to retrieve only the necessary information, rather than reading through the entire collection of websites. While iMRC shares some common properties with multi-step retrieval (Yang et al., 2018a;Zhao et al., 2021) and open-domain QA (Lewis et al., 2020), the focus here is to push the boundaries of information-seeking agents (Bachman et al., 2016) from an RL/navigation perspective.
At game step t, the environment state s t ∈ S represents the semantics and information contained in the full document, as well as which subset of the sentences has been revealed to the agent. The agent perceives text information as its observation, o t ∈ Ω, which depends on the environment state with probability O(o t |s t ). The agent issues an action a t ∈ A, resulting in a state transition s t+1 with probability T (s t+1 |s t , a t ) in the environment (i.e., a new sentence is shown to the agent). Based on its actions, the agent receives rewards r t = R(s t , a t ). The agent's objective is to maximize the expected discounted sum of rewards E ∑ t γ t r t , where γ ∈ [0, 1] is the discount factor.
Difficulty Levels Given a question, only the first sentence of a document is initially exposed to an agent. During information gathering phase, the agent uses the following commands to interact with the environment: 1) previous and 2) next will jump to the previous or next sentence, respectively; 3) Ctrl+F QUERY: jumps to the sentence with the next occurrence of QUERY; 4) stop terminates the interaction. Whenever the agent issues the stop action, or it has exhausted its interaction budget 2 , the information gathering phase is terminated, and the agent needs to answer the question immediately. Thanks to the extractive nature of MRC datasets such as SQuAD, agents can label a span from its observation o t as prediction. Note that in order to correctly answer the question, an agent needs to effectively gather necessary information so that its observation o t contains the answer as a sub-string. Yuan et al. (2020) define easy and hard as two difficulty levels. In the easy mode, all four commands are available during information gathering phase; whereas in the hard mode, only Ctrl+F and stop can be used. Intuitively, in the easy mode, an agent can rely on the next command to traverse the entire document, which essentially reduces the problem to learning to stop at the right sentence. In contrast, in the hard mode, the agent is forced to Ctrl+F in a smart manner to navigate to potentially informative sentences. QUERY Types Three finer-grained settings are further defined according to the action space of the Ctrl+F QUERY command. Specifically, ranging from easy to hard, QUERY can be a token extracted from the question (q), the concatenation of the question and currently observable sentence (q + o t ), or any token selected from the dataset's vocabulary (vocab). From an RL perspective, the sizes of the three settings' action space can differ by several orders of magnitude (e.g., a question with 10 tokens in q versus a vocabulary size of 20K in vocab). Figure 1: Left: an overview of our agent. We propose to use graph representations as an additional input modality to the iMRC agent (Yuan et al., 2020). Right: a zoomed in view of our encoder module, extended from iMRC.

Methodology
In this work, we adopt the agent proposed in (Yuan et al., 2020) as baseline, as shown in Figure 1. We propose to add a novel graph updater module and a graph encoding layer into the pipeline. Specifically, at game step t, the graph updater takes the text observation o t and the graph G t−1 from previous step as input and generates a new graph G t . Subsequently, the graph is encoded into hidden states, which is later aggregated with text representations. Note that distinct from fully observed Knowledge Graphs (KGs) in static MRC works, our graphs are dynamically generated, i.e., at every interaction step, our agent can update information from the new observation into its graph representations.
In this section, we will first introduce the key methods we use to generate and update the graph representations. Later on, we will describe a Graph Neural Network (GNN)-based graph encoder which encodes the information carried by the graphs. For the common components shared with iMRC, we refer readers to (Yuan et al., 2020) or Appendix A for detailed information.
Notations We denote a graph generated at game step t as G t = (V t , E t ), where V t and E t represent the set of vertices (nodes) and edges (relations). All graphs are directed by default. For two nodes i ∈ V and j ∈ V, we denote the connection from i to j as e i→j ∈ E. We represent graph G as an adjacency tensor, with the size of R × N × N , where R and N denote the number of relations and nodes, respectively. This tensor can either be binary or real-valued depending on graph type.

Generating and Updating Graphs
We propose four different graph representations. The four graph types capture distinct aspects of information in text, from lower level linguistic features to high level semantics.
In word co-occurrence graphs, we connect tokens according to their co-occurrences. We assume that words appear in the same sentence tend to be relevant. Common words across sentences further enable to build more complex graphs where each word is connected with multiple related concepts.
Omitting the notation of game step t for simplicity, if two tokens i ∈ V and j ∈ V co-occur in a sentence s, the edge e i→j between i and j is defined as s. Computationally (e.g., for GNNs), the representations of i and j are their word embeddings, the representation of the relation e i→j is the sentence encoding of s. Note in this setting, graphs are symmetrical (i.e., e i→j = e j→i ). Typically, tokens i and j can co-occur in multiple sentences. We thus allow multiple connections to appear between two graph nodes, each connection represents a particular sentence where they co-occur.

Relative Position
R×N ×N , V: words, E: relative position between words.
In relative position graphs, we aim to capture and embed the word ordering and distance information. This can be seen as capturing a loose form of the Subject-Verb-Object (SVO) structure within sentences, without the need of parsing them. The intuition of capturing token position information is supported by the idea of position embeddings in training large-scale language models such as BERT (Devlin et al., 2018;Wang et al., 2021).
In our setting, we first define a window size l ∈ Z + . For any two tokens i and j within a sentence s, their relation e i→j is defined as: (1) In addition to pharma responsibilities, the pharma offered general medical advice and a range of services that are now performed solely by other specialist practitioners, such as surgery and midwifery.
In addition to pharma responsibilities offered in which, pos i and pos j indicate the two tokens' position indices in s. Therefore, the total number of relations R = 2l + 1, the set of relations consists of all integers from −l to l. We also connect tokens with themselves via self-connections (e i→i = 0) to facilitate message passing in GNNs.

Semantic Role Labeling (Parser-based)
R×N ×N , V: chunks returned by SRL, E: semantic role labels of the chunks.
Similar to recent approaches designed for static MRC tasks (Zheng and Kordjamshidi, 2020; Zhang et al., 2020), we investigate building knowledge graphs via Semantic Role Labeling (SRL). An SRL system (Shi and Lin, 2019) can detect the arguments associated with each of the predicates (or verbs) within a sentence and how they are classified into specific roles. This property is essential in tackling MRC tasks, especially for extractive QA datasets, where answers are typically short chunks of text (e.g., entities), such chunks can often be identified as arguments by an SRL parser. Via SRL, we can easily know how arguments interact with each other, further connecting common chunks detected in multiple sentences can produce an argument flow that helps understanding the paragraph.
In our SRL graphs 3 , we use chunks that are identified as predicates, arguments, and modifiers as nodes. We use their semantic role labels w.r.t. their corresponding predicates as relations. For longer sentences, an SRL system often returns multiple predicate-centric graphs. We define a special ROOT node for each sentence, and connect it with all the predicates within the sentence, with a special relation ROOT-VERB. We connect sentences by con-necting their root nodes using a relation ROOT-ROOT. An example of the SRL graph is shown in Figure 2. In order to facilitate models such as GNNs (e.g., easier message passing, denser signals), we define a set of reversed relations in SRL graphs. For instance, in Figure 2, e performed→services is ARG1, we use an ARG1-rev relation as e services→performed in our experiments. However, we ignore the reversed relations in Figure 2  In addition to rule-and parser-based graph updaters, we also investigate a data-driven approach. Inspired by Adhikari et al. (2020), we use selfsupervised learning technique to pre-train a graph updater that can maintain a continuous belief graph for the iMRC task. Specifically, given the new observation o t , we use a neural network to modify graph G t−1 from previous step, to produce a new graph G t . Without the need of manually defining and hand-crafting specific graph structureswhich may inject unnecessary prior into the agent -we assume that as long as a learned graph G t can be used to reconstruct o t , the graph should have contained useful information of the text.
However, learning to reconstruct o t in a word-byword fashion requires the model to learn features that are less useful in the iMRC task. Therefore, we adopt a contrastive representation learning strategy to approximate the reconstruction. Specifically, we train a discriminator D that differentiates between true o t (positive samples) and a "corrupted" version of themõ t (negative samples), conditioned on G t . This relieves the model of the burden to learn syntactical features (NLG), so that it can focus more on the semantic side instead. We use a standard noise-contrastive objective to minimize the binary cross-entropy (BCE) loss (Veličković et al., 2019): where K is the number of sentences in a SQuAD paragraph. To facilitate this pre-training, we utilize an online Wikipedia dump (Wilson, 2013). We remove all articles that appear in the SQuAD dataset, and use the rest as our negative sample collection 4 .
The graph updater is then trained to generate graphs G t that can be used to differentiate between 1) sentences within current document and 2) sentences sampled from another Wikipedia article. Note that G t is not explicitly grounded to any ground-truth graphs. Instead, they are essentially latent recurrent state representations, encoding information the agent has seen so far. Therefore, the nodes and relations in these graphs are determined by the agent itself in a data-driven manner. As a result, the adjacency tensors in these graphs are real-valued. We provide more details of this graph updater in Appendix A.4.

Encoding Graph Representations
We adopt a multi-layer relational graph convolutional network (R-GCN) (Schlichtkrull et al., 2018;Adhikari et al., 2020) as our graph encoder. Specifically, at the l-th layer of the R-GCN, for each node i ∈ V, given the set of its neighbor nodes V e i ∈ V under relation e ∈ E, the R-GCN computes: where W l e and W l 0 are trainable parameters. When the graph is discrete (i.e., word co-occurrence, relative position, SRL), we use ReLU as the activation function σ; when the graph is continuous (i.e., continuous belief), we use Tanh function as σ to stabilize the model.
When the labels of graph nodes consist of tokens, we integrate their word representations into graph computation. Specifically, for a node i, we use the concatenation of a randomly initialized node embedding vector and the averaged word embeddings of node label as the initial input h 0 i . Similarly, for each relation e, Emb e is the concatenation of a randomly initialized relation embedding vector and the averaged word embeddings of e's label.
We utilize highway connections (Srivastava et al., 2015) between R-GCN layers: where ⊙ indicates element-wise multiplication, W hw is a linear layer. We denote the final output of the R-GCN as h G t ∈ R N ×H , where N is the number of nodes in the graph, H is hyperparameter.

Aggregating Multiple Modalities
Following Yuan et al. (2020), we utilize the contextquery attention mechanism (Yu et al., 2018) to aggregate multiple representations. The inputs to a context-query attention layer are typically two sequences of representation vectors (e.g., sequence of tokens for text, sequence of nodes for graphs). The attention computes element-wise similarity scores between the two inputs, then each element in one input can be represented by the weighted sum of the other input, and vice versa. As shown in Figure 1 (right), we stack another context-query attention layer on top of the encoder used in iMRC, to aggregate the text representation (which encodes information in o t and q) with graph representation. We denote the output from the second attention layer as Although all the four graph types we investigate are updated dynamically, they can only represent the agent's belief of the current state s t . There are clearly some information hard to be represented in the graphs, such as how did an agent navigate to the current sentence (i.e., the trajectories). We thus leverage a recurrent neural network to incorporate history information into encoder's output representations. Specifically, we use a GRU (Cho et al., 2014) as the recurrent component: in which, h inp ∈ R H . M t−1 is the output of the GRU cell at game step t − 1. As shown in Figure 1 (left), the output of encoder, M t , is then used to both generating actions during information gathering phase, as well as extracting answers during question answering phase, this procedure exactly follows the iMRC pipeline.

Experiments and Results
We conduct experiments on the iSQuAD dataset (Yuan et al., 2020) to answer three key questions: • Q1: Do graph representations help agents achieving better performance? In particular, among the four graph types, which of them provides the most performance boost?  Table 1: Testing F 1 scores and the relative improvement %RI (averaged over six settings in a row). Best single agent scores within each setting are highlighted with boldface, scores better than iMRC are shaded in yellow.
• Q3: If graph representations are great, can we get rid of the text modality?
Experiment Setup: The iSQuAD dataset (Yuan et al., 2020) is an interactive version of the SQuAD dataset (Rajpurkar et al., 2016), which consists of 82k/5k/10k environments for training, validation, and testing. As described in Section 2, iSQuAD contains two difficulty levels and three finer-grained QUERY type settings, all of which influence an RL agent's action space. The environment provides an observation queue with k memory slots depending on different configurations, where k ∈ {1, 3, 5}. The observation queue stores the k most recent observation sentences to alleviate difficulties caused by partial observability. Note in configuration where k = 1, there is no history information stored.
Inherited from the original SQuAD dataset, an agent is evaluated by the F 1 score between its predicted answer and the ground-truth answers. We compare our agents equipped with graph representations against iMRC scores reported in (Yuan et al., 2020), specially, we also report an agent m's relative improvement over iMRC: For all experiment settings, we train the agent with three different random seeds. We compute an agent's test score using the model checkpoint that achieves the best validation score.

A1: Graph representations indeed help, and ensemble is an useful strategy.
Intuitively, a dynamically maintained graph can serve as an agent's episodic memory. Therefore, the less information is provided by the environment, the more useful the graphs can be. We first investigate our graph aided agent's performance on the game configuration where only single memory slot is available. This is arguably the most difficult configuration in iMRC, where any valid action can lead to a completely different observation (a new sentence). As a result, agents needs to rely on its own memory mechanism. As shown in Table 1 (#Mem Slot = 1), our agent outperforms iMRC in most of the settings by a noticeable margin. We observe that the improvement brought by graph representations is consistent across the four graph types. All of the four graph types provide over 9% of average relative improvement over iMRC. Among the four graph types, relative position graph and SRL graph seem to show advantage over the other two types, but this trend is not as significant.
Following standard strategy of model ensemble in MRC works, we test the ensemble of the four graphs. Specifically, taking four individual agents, each trained with its corresponding graph types, we mix their decisions during test. During the information gathering phase, we sum up the four agents' output probabilities, including the probabilities over action words (i.e., previous, next, ctrl+f, stop) and the probabilities over the QUERY tokens.  The four agents consequently take the action with the max summed probabilities to keep interacting with the environment. During the question answering phase, we also sum up the output probabilities (over tokens in the sentence where the agents stop), and generate answers accordingly. Surprisingly, we find that the ensemble greatly boosts agent's performance. As shown in Table 1 (#Mem Slot = 1), the ensemble agent nearly doubles our agent's relative improvement over iMRC. It is also worth noting that with ensembling, our agent achieved better score than iMRC in the Hard Mode + vocab setting, which all the individual agent fail to outperform the baseline. This observation aligns with our motivation that the four types of graphs capture different aspects of the information and thus may be complementary to each other.
A2: Graph representations remain helpful even with explicit memories.
As mentioned above, in some configurations, the iSQuAD environment provides an observation queue that caches most recent few observations as an explicit memory mechanism. A natural thing to explore is that if the advantages of equipping graph representations tend to diminish when the partial observability of the environments decreases. We train and test our agent using iSQuAD's configurations where 3 or 5 memory slots are available (i.e., at game step t, the input o t to the agent is the concatenation of the most recent 3 or 5 sentences it has seen). From Table 1, we observe that the previously observed trends are consistent across different memory slot number configurations. Particularly, in the settings with 3 or 5 memory slots, our single agents equipped with graph representations can outperform iMRC in most of the settings. All graph types provide a greater than %8 of averaged relative improvements. Again, the ensemble agent nearly doubles single agents' relative improvement over iMRC. In the setting with 5 memory slots, we observe that the continuous belief graph is consistently outperforming its counterparts, which provides a %12.19 of relative improvement.
Given the observation that graph representations seem still helpful even with explicit memories, we further compare graph as memory mechanism against the explicit memory slots provided by iSQuAD environments. Comparing our best graph aided agent (receiving single sentence as input) against iMRC (receiving 3 and 5 sentences as input), we find our agent achieves a %13.03 and %13.47 of relative improvements over iMRC. This suggests that the design of the memory mechanism plays a big role in the interactive reading comprehension tasks. Although the concatenation of memory slots may provide as much amount of information, the inductive bias of graph representations are stronger.
Based on our findings in previous subsections, we further investigate whether the dynamic graphs can replace the text modality. We conduct a set of ablation experiments on the Easy Mode + q games, with single memory slot. Specifically, at every game step t, given the new observation o t , we use the graph updater (described in Section 3.1) to generate graph representations G t . Encoded by the graph encoder (described in Section 3.2), we directly aggregate the graph encoding with the question representations for further computations. In this way, the observation sentence o t is only used to build the graph, without serving as a direct input modality to the agent. 5 After training and testing such graph-only variants of our agent, we compare them against the text-only version (Yuan et al., 2020) and our full agent with both input modalities in Table 2. We observe that the graph-only agent fails to outperform the text-only baseline with any of the graph types, even with the ensemble of them. This suggests that even though the text and graph modalities may contain redundant information (because the graphs are generated from the text), they represent the information complementarily in some sense. We suspect the attention mechanism integrating text representations and graph representations (described in Section 3.3) may have contributed to the improvement of the full agent. For instance, the  agent may have learned to focus on certain subgraph conditioned on tokens in o t , and vice versa.

Additional Results and Discussion
As described in Section 3.3, our agent utilizes a recurrent component to be aware of history information, this component is absent in the original iMRC architecture. Therefore, it is important to make sure the performance improvement shown in previous subsections are not solely caused by the RNN. We conduct a set of experiments, with the RNN layer disabled (i.e., the output of attention layer becomes M t in Figure 1 right). We show results of these experiments in Table 3, due to space limitation, we only show the averaged relative improvement over iMRC, readers can find full results in Appendix C. Overall, graph representations contribute more to the improvement (for single agents, more than %5 on average), this is especially clear for the ensemble agents, where even without RNN, agents can sometimes achieve very close performance with the full agent. However, the effect of the RNN is non-negligible. This again emphasizes the importance of memory mechanism in interactive reading comprehension tasks. From our finding, multiple distinct memory mechanisms (i.e., memory slots, graphs, RNN cells) do not seem redundant, rather, they work cooperatively to produce a better score than solely using any of them.
It is noticeable in Table 1 that all agents performs poorly on Hard Mode + vocab games. This reveals limitations of RL-based algorithm (such as deep Q-learning we use in this work) -when the action space is extremely large, the agent has near-zero probability to experience a trajectory that leads to any positive reward, and thus struggles to learn useful strategies. This can potentially be mitigated by pre-training the agent with an easier setting then fine-tune in the difficult setting so that the agent has higher probability to experience good trajectories to start with.
A recent work (Guo et al., 2021) propose to facilitate RL learning in tasks with huge action spaces (e.g., natural language generation) using Path Consistency Learning (PCL). Their PCL-based training method can update Q-values of all actions (tokens in vocabulary) at once, as opposed to only update the selected action (one token) in vanilla Q-Learning. This can potentially enable iMRC agents to perform in a more natural and generic manner, for instance, to Ctrl+F multi-word expressions as QUERY.
Due to space limitation, we report detailed agent structure, more results, and implementation details in Appendices.

Related Work
MRC has become an ever-growing area in the past decade, especially since the success of deep neural models. Like an adversarial game, researchers release new datasets (Hill et al., 2015;Chen et al., 2016;Rajpurkar et al., 2016;Trischler et al., 2016a;Nguyen et al., 2016;Reddy et al., 2018;Yang et al., 2018a;Choi et al., 2018;Clark et al., 2020)  While some researchers believe models have achieved human-level performance, others argue that there have been biases or trivial cues injected into MRC datasets unconsciously (Agrawal et al., 2016;Weissenborn et al., 2017;Mudrakarta et al., 2018;Sugawara et al., 2018;Niven and Kao, 2019;Sen and Saffari, 2020). These biases may cause models to learn shallow pattern matching, rather than deep understanding and reasoning skills.
iMRC (Yuan et al., 2020) is a line of research that assumes partial observability and insufficient information. To answer a question, models have to actively collect necessary information by interacting with the environment. The iMRC paradigm can be described naturally within the RL framework, and thus it shares interests with video games (

Broader Impact
Our work is a proof-of-concept study, we use a relatively simple and restricted (in terms of both observations and actions) QA dataset, iSQuAD, for both training and evaluation. Although the current version of our work might have limited consequences for society, we believe that taking a broader view of our work can be beneficial by preventing our future research from causing potential social and ethical concerns.
Similar to many RL-based systems, the information gathering module of our agent is optimized solely on its performance w.r.t. the final metric, without much constraints on its behavior at each game step. This can potentially make the system vulnerable since the RL agent may develop undesirable strategies that optimize the final metric.
In our current setting, the action space of the information gathering module is restricted (see Section 2). However, if we consider a more general setting, e.g., to equip the agent with a larger action space by allowing it to generate a sequence of tokens as the QUERY to the Ctrl+F action, we have to be extra careful about the aforementioned side effects caused by RL training. For instance, the agent may develop unfavorable behaviors such as forgetting proper syntax, abusing certain pronouns, to optimize its final rewards.

Conclusion
We explore to leverage graph representations in the challenging iMRC tasks. We investigate different categories of graph structures that can capture text information at various levels. We describe methods that dynamically generate the graphs during information gathering. Experiment results show that graph representations provide consistent improvement across settings. This evinces our hypothesis that graph representations are proper inductive biases in iMRC.

Contents in Appendices:
• In Appendix A, we provide detailed information of our agent architecture.
• In Appendix B, we provide implementation details.
• In Appendix C, we provide the full set of our experiment results.

A Details on Agent Structure
In this section, we provide detailed information of our agent. We will describe each of the modules shown in Figure 1. Some information here may be redundant with what we describe in Section 3, we repeat them here for reader's convenience.

Notations
We use game step t to denote one round of interaction between an agent with the iSQuAD environment. We use o t to denote text observation at game step t, and q to denote question text. We use L to refer to a linear transformation, superscript of L denotes the activation function applied to the linear layer. Brackets [⋅; ⋅] denote vector concatenation.

A.1 Encoder
At a game step t, the encoder takes text observation (a sentence) o t , the question q, and the graph G t generated by the graph updater (if applicable) as input. It first converts each input into vector representations, then aggregates them using attention mechanism.

A.1.1 Text Encoder
We use a transformer-based encoder, which consists of an embedding layer and a transformer block (Vaswani et al., 2017). Specifically, embeddings are initialized by vectors extracted from a BERT model (Devlin et al., 2018) that is pre-trained on large corpus and fine-tuned on SQuAD 6 . The embedding size is 1024, they are fixed during training in all settings. The transformer block consists of a stack of 4 convolutional layers, a self-attention layer, and a 2-layer MLP with a ReLU non-linear activation function in between. Within the block, each convolutional layer has 96 filters, with the kernel size of 7. In the self-attention layer, we use a block hidden size H of 96, as well as a single head attention mechanism. Layer normalization (Ba et al., 2016) is applied after each component inside the block. Following standard transformer training, we add positional embeddings into each block's input.
At every game step t, we use the same text encoder to process o t and q. The resulting represen- L o t is the number of tokens in o t , L q denotes the number of tokens in q, H = 96 is the hidden size.

A.1.2 Graph Encoder
We adopt the graph encoder from (Adhikari et al., 2020), which is a model based on R- GCN (Schlichtkrull et al., 2018). Specifically, at the l-th layer of the R-GCN, for each node i ∈ V, given the set of its neighbor nodes V e i ∈ V under relation e ∈ E, the R-GCN computes: where W l e and W l 0 are trainable parameters. When the graph is discrete (i.e., word co-occurrence, relative position, SRL), we use ReLU as the activation function σ; when the graph is continuous (i.e., continuous belief), we use Tanh function as σ to stabilize the model.
As the initial input h 0 to the graph encoder, we concatenate a node embedding vector and the averaged word embeddings of node text (e.g., a word in word co-occurrence graph, a chunk of a sentence in SRL graph). Similarly, for each relation e, Emb e is the concatenation of a relation embedding vector and the averaged word embeddings of e's label. Both node embedding and relation embedding vectors are randomly initialized and trainable. We utilize highway connections (Srivastava et al., 2015) between layers: where ⊙ indicates element-wise multiplication.
We use a 3-layer graph encoder, with a hidden size H = 96 in each layer. The node embedding size and relation embedding size are 100 and 32, respectively. The number of bases we use is 3. The final output of graph encoder is where N is the number of nodes in the graph.

A.1.3 Attention Context-Query Attention
To aggregate the question q with context comes from various modality (i.e., text and graph), we adopt the context-query attention layer from (Yu et al., 2018). We use a unified notion c to represent the context in the description of the context-query attention, the encoding of c is denoted as h c ∈ R L c ×H .
The attention layer first uses two MLPs to convert both h c and h q into the same space, the resulting tensors are denoted as h where W is trainable parameters with hidden size 96. Softmax of the resulting similarity matrix S along both dimensions are computed, this produces S A and S B . Information in the two representations are then aggregated by: Next, a linear transformation projects the aggregated representations to a space with size H = 96: Now, h cq ∈ R L c ×H is aggregated context-query representation.
Context-Context Attention Given the aggregated text-query representations h oq and the aggregated graph-query representations h gq , the contextcontext attention aims to merge them together and generate an overall representation that encodes all available information the agent has seen so far. Specifically, the context-context attention is implemented as a stacked layers of transformer blocks. The structure of these transformer blocks is similar with the text encoder blocks, except we append an extra attention layer after its self-attention mechanism. This extra attention layer computes the attention between text-query representations with graph-query representations, followed with an extra layer normalization. In each block, we use a stack of 2 convolutional layers, each convolutional layer has 94 filters with kernel size of 5. It is worth noting that this additional attention layer is performed only when graph representations are enabled, and is skipped otherwise. We stack 7 such transformer layers, they output h og ∈ R L o t ×H , where L o t is the length of o t , H = 96 is hidden size.

A.1.4 Recurrent Component
As mentioned in Section 4, we have a setting where the encoder is recurrent, so that the agent can incorporate history information into its representations.
In that specific setting, we use a GRU (Cho et al., 2014) as the recurrent component: in which, the mean pooling is performed along the dimension of number of tokens, i.e., h inp ∈ R H .
h t−1 is the output of the GRU cell at game step t − 1.

A.2 Action Generator
Let M ∈ R H denote the output of the attention layers described above: The action generator takes M as input and generates rankings for all possible actions. As defined in (Yuan et al., 2020), a Ctrl+F command is composed of two tokens (the token "Ctrl+F" and the QUERY token). Therefore, the action generator consists of three multi-layer perceptrons (MLPs): In which, Q action and Q query are Q-values of action token and QUERY token (when action token is "Ctrl+F"), respectively. The hidden size of L shared is 150. The hidden size of L action is either 2 (hard, only Ctrl+F and stop commands are allowed) or 4 (easy, previous and next are also allowed) depending on the game mode. We follow (Press and Wolf, 2017), tying the input embeddings and output embeddings. Specifically, a linear layer L query followed by a Tanh activation projects h shared into the same space as the embeddings (with dimensionality of 1024), then the pre-trained BERT embedding matrix generates output logits Q query (Q-values) where the output size is same as the vocabulary size.
Under different settings where the selection spaces of QUERY are specified, we apply different masks to Q query . For instance, in the setting where the QUERY is a word selected from q + o t , we use a mask which has same size as the vocabulary, where only tokens appear in either q and o t are set to 1.

A.3 Question Answerer
Whenever the action generator generates the command stop, or the agent has used up all its limit of moves, the information gathering phase terminates. At this point, the agent has to use its current internal representations to answer the question.
The question answer is a simple MLP-based layer. It takes h og ∈ R L o t ×H as input, and generates a head distribution and a tail distribution over tokens in o t :

A.4 Graph Updater: Continuous Belief
Among all the four proposed graph categories, only the continuous belief graph is generated by the agent, and the graph updater is trained with a datadriven approach. Therefore, we describe the structure of this graph updater and the way we train it. Because a large portion of this module is adopted from (Adhikari et al., 2020), we provide a high level of the method and refer readers to (Adhikari et al., 2020) for detailed information.
We show the continuous belief graph updater training pipeline in Figure 3, it consists of two parts: the graph updater itself (red block on the left) and a decoder that helps to train the graph updater (green block on the right). As mentioned in Section 3.1.4, the idea is to train a graph updater that can modify and maintain a graph G t using the text observation o t and the graph from previous game step G t−1 . The graph G t should contain sufficient information so that conditioned solely on G t , a discriminator can differentiate true observation o t from negative sampleõ t .
In Figure 3, text encoder and graph encoder are similar modules as described in Appendix A.1. The f ∆ function is a layer with attention mechanism inside, it aggregates the text representations and graph representations. The f ∆ function outputs a vector ∆g t , which represents the new information seen in the new observation o t , compared to the graph at previous game step G t−1 .
The ⊕ function is a graph operation function that produces the new belief representation h t given h t−1 and ∆g t : The graph operation function is implemented with a GRU (Cho et al., 2014). The function f d is an MLP that decodes the recurrent state h t into a realvalued adjacency tensor (i.e., the continuous belief graph G t ).
At the decoder side (green block on the right), the graph representations of G t are concatenated with both the text representations of o t andõ t , the resulting vectors are fed into an MLP-based discriminator. The discriminator is trained with the standard binary cross-entropy (BCE) loss.
After pre-training, the graph updater (red block on the left) is fixed and plugged into our RL agent to produce continuous belief graphs.

B Implementation Details
In this section, we provide hyperparameters and other implementation details.
For our full agent, we adopt the training procedure of DRQN (Hausknecht and Stone, 2015;Yuan et al., 2018) to train the agent. For the agent variants where recurrent component is absent, we use the training procedure of DQN (Mnih et al., 2013).
For all experiments, we use Adam (Kingma and Ba, 2015) as the optimizer. The learning rate is set to 0.00025 with a clip gradient norm of 5. We train all agents with 3 different random seeds. We choose the random seed which produces the best validation performance, and report its scores on the test set. To be comparable with iMRC, we also train our agents with 1 million episodes, each episode has a maximum number of steps 20.
We train all agents for 1 million episodes, this is the same number of episodes reported in (Yuan et al., 2020). Running speed of agents depend on the specific configuration, e.g., the type of graph equipped by an agent. On average, achieving best validation score takes an agent about 3 days on a single Nvidia P100 GPU.

C Full Results
In Table 4,5,6, we provide full results on our experiments. Although the only metric to evaluate an agent's performance on the iMRC task is the F 1 score between the prediction with the groundtruth answers, in (Yuan et al., 2020), the authors also monitor agents' sufficient information reward. Specifically, sufficient information rewards are binary rewards representing if the final observation (either the agent generates the stop action, or it has used up all its moves) contains the ground-truth answer as a sub-string. Intuitively, because of the extractive nature of the question answerer module, the agent can answer the question correctly if and only if it achieves a 1.0 sufficient information reward on a specific data point. We provide the sufficient information rewards of our agents in the full result tables, colored in blue.    Table 6: #Memory slot = 5. Testing F 1 in black and sufficient information rewards in blue. %RI represents relative improvement over iMRC on corresponding metric, across settings.