Learning and Reasoning for Robot Dialog and Navigation Tasks

Reinforcement learning and probabilistic reasoning algorithms aim at learning from interaction experiences and reasoning with probabilistic contextual knowledge respectively. In this research, we develop algorithms for robot task completions, while looking into the complementary strengths of reinforcement learning and probabilistic reasoning techniques. The robots learn from trial-and-error experiences to augment their declarative knowledge base, and the augmented knowledge can be used for speeding up the learning process in potentially different tasks. We have implemented and evaluated the developed algorithms using mobile robots conducting dialog and navigation tasks. From the results, we see that our robot’s performance can be improved by both reasoning with human knowledge and learning from task-completion experience. More interestingly, the robot was able to learn from navigation tasks to improve its dialog strategies.


Introduction
Knowledge representation and reasoning (KRR) and reinforcement learning (RL) are two important research areas in artificial intelligence (AI) and have been applied to a variety of problems in robotics. On the one hand, KRR research aims to concisely represent knowledge, and robustly draw conclusions with the knowledge (or generate new knowledge). Knowledge in KRR is typically provided by human experts in the form of declarative rules. Although KRR paradigms are strong in representing and reasoning with knowledge in a variety of forms, they are not designed for (and hence not good at) learning from experiences of accomplishing the tasks. On the other hand, RL algorithms enable agents to learn by interacting with an environment, and RL agents are good at learning action policies from trial-and-error experiences toward maximizing long-term rewards un-der uncertainty, but they are ill-equipped to utilize declarative knowledge from human experts. Motivated by the complementary features of KRR and RL, we aim at a framework that integrates both paradigms to enable agents (robots in our case) to simultaneously reason with declarative knowledge and learn by interacting with an environment.
Most KRR paradigms support the representation and reasoning of knowledge in logical form, e.g., Prolog-style. More recently, researchers have developed hybrid KRR paradigms that support both logical and probabilistic knowledge (Richardson and Domingos, 2006;Bach et al., 2017;Wang et al., 2019). Such logical-probabilistic KRR paradigms can be used for a variety of reasoning tasks. We use P-log (Baral et al., 2009) in this work to represent and reason with both human knowledge and the knowledge from RL. The reasoning results are then used by our robot to compute action policies at runtime.
Reinforcement learning (RL) algorithms can be used to help robots learn action policies from the experience of interacting with the real world (Sutton and Barto, 2018). We use model-based RL in this work, because the learned world model can be used to update the robot's declarative knowledge base and combined with human knowledge.
Theoretical Contribution: In this paper, we develop a learning and reasoning framework (called KRR-RL) that integrates logical-probabilistic KRR and model-based RL. The KRR component reasons with the qualitative knowledge from humans (e.g., it is difficult for a robot to navigate through a busy area) and the quantitative knowledge from modelbased RL (e.g., a navigation action's success rate in the form of a probability). The hybrid knowledge is then used for computing action policies at runtime by planning with task-oriented partial world models. KRR-RL enables a robot to: i) represent the probabilistic knowledge (i.e., world dynamics) learned from RL in declarative form; ii) unify and reason with both human knowledge and the knowledge from RL; and iii) compute policies at runtime by dynamically constructing task-oriented partial world models.
Application Domain: We use a robot delivery domain for demonstration and evaluation purposes, where the robot needs to dialog with people to figure out the delivery task's goal location, and then physically take navigation actions to complete the delivery task Veloso, 2018). A delivery is deemed successful only if both the dialog and navigation subtasks are successfully conducted. We have conducted experiments using a simulated mobile robot, as well as demonstrated the system using a real mobile robot. Results show that the robot is able to learn world dynamics from navigation tasks through model-based RL, and apply the learned knowledge to both navigation tasks (with different goals) and delivery tasks (that require subtasks of navigation and dialog) through logical-probabilistic reasoning. In particular, we observed that the robot is able to adjust its dialog strategy through learning from navigation behaviors.

Related Work
Research areas related to this work include integrated logical KRR and RL, relational RL, and integrated KRR and probabilistic planning.
Logical KRR has previously been integrated with RL. Action knowledge (McDermott et al., 1998;Jiang et al., 2019) has been used to reason about action sequences and help an RL agent explore only the states that can potentially contribute to achieving the ultimate goal (Leonetti et al., 2016). As a result, their agents learn faster by avoiding choosing "unreasonable" actions. A similar idea has been applied to domains with nonstationary dynamics (Ferreira et al., 2017). More recently, task planning was used to interact with the high level of a hierarchical RL framework (Yang et al., 2018). The goal shared by these works is to enable RL agents to use knowledge to improve the performance in learning (e.g., to learn faster and/or avoid risky exploration). However, the KRR capabilities of these methods are limited to logical action knowledge. By contrast, we use a logicalprobabilistic KRR paradigm that can directly reason with probabilities learned from RL.
Relational RL (RRL) combines RL with relational reasoning (Džeroski et al., 2001). Action models have been incorporated into RRL, resulting in a relational temporal difference learning method (Asgharbeygi et al., 2006). Recently, RRL has been deployed for learning affordance relations that forbid the execution of specific actions (Sridharan et al., 2017). These RRL methods, including deep RRL (Zambaldi et al., 2018), exploit structural representations over states and actions in (only) current tasks. In this research, KRR-RL supports the KRR of world factors beyond those in state and action representations, e.g., time in navigation tasks, as detailed in Section 4.2.
The research area of integrated KRR and probabilistic planning is related to this research. Logicalprobabilistic reasoning has been used to compute informative priors and world dynamics Amiri et al., 2020) for probabilistic planning. An action language was used to compute a deterministic sequence of actions for robots, where individual actions are then implemented using probabilistic controllers (Sridharan et al., 2019).
Recently, human-provided information has been incorporated into belief state representations to guide robot action selection (Chitnis et al., 2018). In comparison to our approach, learning (from reinforcement or not) was not discussed in the abovementioned algorithms.
Finally, there are a number of robot reasoning and learning architectures (Tenorth and Beetz, 2013;Oh et al., 2015;Hanheide et al., 2017;, which are relatively complex, and support a variety of functionalities. In comparison, we aim at a concise representation for robot KRR and RL capabilities. To the best of our knowledge, this is the first work on a tightly coupled integration of logical-probabilistic KRR with model-based RL.

Background
We briefly describe the two most important building blocks of this research, namely model-based RL and hybrid KRR.

Model-based Reinforcement Learning
Following the Markov assumption, a Markov decision process (MDP) can be described as a fourtuple S, A, T, R (Puterman, 1994). S defines the state set, where we assume a factored space in this work. A is the action set. T : S × A × S → [0, 1] specifies the state transition probabilities. R : S × A → R specifies the rewards. Solving an MDP produces an action policy π : s → a that maps a state to an action to maximize long-term rewards.
RL methods fall into classes including modelbased and model-free. Model-based RL methods learn a model of the domain by approximating R(s, a) and P(s |s, a) for state-action pairs, where P represents the probabilistic transition system. An agent can then use planning methods to calculate an action policy (Sutton, 1990;Kocsis and Szepesvári, 2006). Model-based methods are particularly attractive in this work, because they output partial world models that can better accommodate the diversity of tasks we are concerned with, c.f., modelfree RL that is typically goal-directed.
One of the best known examples of model-based RL is R-Max (Brafman and Tennenholtz, 2002), which is guaranteed to learn a near-optimal policy with a polynomial number of suboptimal (exploratory) actions. The algorithm classifies each state-action pair as known or unknown, according to the number of times it was visited. When planning on the model, known state-actions are modeled with the learned reward, while unknown stateactions are given the maximum one-step reward, R max . This "maximum-reward" strategy automatically enables the agent to balance the exploration of unknown states and exploitation. We use R-Max in this work, though KRR-RL practitioners can use supervised machine learning methods, e.g., imitation learning (Osa et al., 2018), to build the model learning component.

Logical Probabilistic KRR
KRR paradigms are concerned with concisely representing and robustly reasoning with declarative knowledge. Answer set programming (ASP) is a non-monotonic logical KRR paradigm (Baral, 2010;Gelfond and Kahl, 2014) building on the stable model semantics (Gelfond and Lifschitz, 1988). An ASP program consists of a set of logical rules, in the form of "head :-body", that read "head is true if body is true". Each ASP rule is of the form: where a...f are literals that correspond to true or false statements. Symbol not is a logical connective called default negation; not l is read as "it is not believed that l is true", which does not imply that l is false. ASP has a variety of applications (Erdem et al., 2016). Traditionally, ASP does not explicitly quantify degrees of uncertainty: a literal is either true, false or unknown. P-log extends ASP to allow probability atoms (or pr-atoms) (Baral et al., 2009;Balai and Gelfond, 2017). The following pr-atom states that, if B holds, the probability of a(t)=y is v: where B is a collection of literals or their default negations; a is a random variable; t is a vector of terms (a term is a constant or a variable); y is a term; and v ∈ [0, 1]. Reasoning with an ASP program generates a set of possible worlds: {W 0 ,W 1 , · · · }. The pr-atoms in P-log enable calculating a probability for each possible world. Therefore, P-log is a KRR paradigm that supports both logical and probabilistic inferences. We use P-log in this work for KRR purposes.

KRR-RL Framework
KRR-RL integrates logical-probabilistic KRR and model-based RL, and is illustrated in Figure 1. The KRR component includes both declarative qualitative knowledge from humans and the probabilistic knowledge from model-based RL. When the robot is free, the robot arbitrarily selects goals (different navigation goals in our case) to work on, and learns the world dynamics, e.g., success rates and costs of navigation actions. When a task becomes available, the KRR component dynamically constructs a partial world model (excluding unrelated factors), on which a task-oriented controller is computed using planning algorithms. Human knowledge concerns environment variables and their dependencies, i.e., what variables are related to each action. For instance, the human provides knowledge that navigation actions' success rates depend on current time and area (say elevator areas are busy in the mornings), while the robot must learn specific probabilities by interacting with the environment.
Why is KRR-RL needed? Consider an indoor robot navigation domain, where a robot wants to maximize the success rate of moving to goal positions through navigation actions. Shall we include factors, such as time, weather, positions of human walkers, etc, into the state space? On the one hand, to ensure model completeness, the answer should be "yes". Human walkers and sunlight (that blinds robot's LiDAR sensors) reduce the success rates of the robot's navigation actions, and both can cause the robot irrecoverably lost. On the other hand, to ensure computational feasibility, the answer is "no". Modeling whether one specific grid cell being occupied by humans or not introduces one extra dimension in the state space, and doubles the state space size. If we consider (only) ten such grid cells, the state space becomes 2 10 ≈ 1000 times bigger. As a result, RL practitioners frequently have to make a trade-off between model completeness and computational feasibility. In this work, we aim at a framework that retains both model scalability and computational feasibility, i.e., the agent is able to learn within relatively little memory while computing action policies accounting for a large number of domain variables.

A General Procedure
In factored spaces, state variables V = {V 0 ,V 1 , ...,V n−1 } can be split into two categories, namely endogenous variables V en and exogenous variables V ex (Chermack, 2004), where V en = {V en 0 ,V en 1 , ...,V en p−1 } and V ex = {V ex 0 ,V ex 1 , ...,V ex q−1 }. In our integrated KRR-RL context, V en is goal-oriented and includes the variables whose values the robot wants to actively change so as to achieve the goal; and V ex corresponds to the variables whose values affect the robot's action outcomes, but the robot cannot (or does not want to) change their values. Therefore, V en and V ex both depend on task τ. Continuing the navigation example, robot position is an endogenous variable, and current time is an exogenous variable. For each task, V = V en ∪ V ex and n = p + q, and RL agents learn in spaces specified by V en .
The KRR component models V , their dependencies from human knowledge, and conditional probabilities on how actions change their values, as learned through model-based RL. When a task arrives, the KRR component uses probabilistic rules to generate a task-oriented Markov decision process (MDP) (Puterman, 1994), which only contains a subset of V that are relevant to the current task,

Procedure 1 Learning in KRR-RL Framework
Require: Logical rules Π L ; probabilistic rules Π P ; random variables V = {V 0 ,V 1 , ...,V n−1 }; task selector ∆; and guidance functions (from human knowledge) of f V (V, τ) and f A (τ) 1: while Robot has no task do 2: τ ← ∆(): a task is heuristically selected 3: Initialize agent: agent ← R-Max(M) 7: RL agent repeatedly works on task τ, and keeps maintaining task model M , until policy convergence 8: end while 9: Use M to update Π P i.e., V en , and their transition probabilities. Given this task-oriented MDP, a corresponding action policy is computed using value iteration or policy iteration.
Procedures 1 and 2 focus on how our KRR-RL agent learns by interacting with an environment when there is no task assigned. 1 Next, we present the details of these two interleaved processes.
Procedure 1 includes the steps of the learning process. When the robot is free, it interacts with the environment by heuristically selecting a task 2 , and repeatedly using a model-based RL approach, R-Max (Brafman and Tennenholtz, 2002) in our case, to complete the task. The two guidance functions come from human knowledge. For instance, given a navigation task, it comes from human knowledge that the robot should model its own position (specified by f V ) and actions that help the robot move between positions (specified by f A ). After the policy converges or this learning process is interrupted (e.g., by task arrivals), the robot uses the learned probabilities to update the corresponding world dynamics in KRR. For instance, the robot may have learned the probability and cost of navigating through a particular area in early morning. In case this learning process is interrupted, the sofar-"known" probabilities are used for knowledge base update.
Procedure 2 includes the steps for building the probabilistic transition system of MDPs. The key point is that we consider only endogenous variables in the task-specific state space. However, when 1 As soon as the robot's learning process is interrupted by the arrival of a real service task (identified via dialog), it will call Procedure 2 to generate a controller to complete the task. This process is not included in the procedures.
2 Here curriculum learning in RL (Narvekar et al., 2017) can play a role to task selection and we leave this aspect of the problem for future work.  Figure 2: Transition system specified for delivery tasks, where question-asking actions are used for estimating the service request in dialog. Once the robot becomes confident about the service request, it starts to work on the navigation subtask. After the robot arrives, the robot might have to come back to the dialog subtask and redeliver, depending on whether the service request was correctly identified.
reasoning to compute the transition probabilities (Line 5), the KRR component uses both Π P and V ex . The computed probabilistic transition systems are used for building task-oriented controllers, i.e., π, for task completions. In this way, the dynamically constructed controllers do not directly include exogenous variables, but their parameters already account for the values of all variables. Next, we demonstrate how our KRR-RL framework is instantiated on a real robot.

An Instantiation on a Mobile Robot
We consider a mobile service robot domain where a robot can do navigation, dialog, and delivery tasks. A navigation task requires the robot to use a sequence of (unreliable) navigation actions to move from one point to another. In a dialog task, the robot uses spoken dialog actions to specify service requests from people under imperfect language understanding. There is the trend of integrating language and navigation in the NLP and CV communities (Chen et al., 2019;Shridhar et al., 2020). In this paper, they are integrated into delivery tasks that require the robot to use dialog to figure out the delivery request and conduct navigation tasks to physically fulfill the request. Specifically, a delivery task requires the robot to deliver item I to room R for person P, resulting in services in the form of <I,R,P>. The challenges come from unreliable human language understanding (e.g., speech recognition) and unforeseen obstacles that probabilistically block the robot in navigation.

Human-Robot Dialog
The robot needs spoken dialog to identify the request under unreliable language understanding, and navigation controllers for physically making the delivery.
The service request is not directly observable to the robot, and has to be estimated by asking questions, such as "What item do you want?" and "Is

Procedure 2 Model Construction for Task Completion
for each a ∈ A do 4: for each possible value v in range(V i ) do 5: M(v |a, v) ← Reason with Π L and Π P w.r.t V ex 6: end for 7: end for 8: end for 9: end for 10: return M this delivery for Alice?" Once the robot is confident about the request, it takes a delivery action (i.e., serve(I,R,P)). We follow a standard way to use partially observable MDPs (POMDPs) (Kaelbling et al., 1998) to build our dialog manager, as reviewed in (Young et al., 2013). The state set S is specified using curr s. The action set A is specified using serve and question-asking actions. Question-asking actions do not change the current state, and delivery actions lead to one of the terminal states (success or failure). 3 After the robot becomes confident about the request via dialog, it will take a delivery action serve{I,R,P}. This delivery action is then implemented with a sequence of act move actions. When the request identification is incorrect, the robot needs to come back to the shop, figure out the correct request, and redeliver, where we assume the robot will correctly identify the request in the second dialog. We use an MDP to model this robot navigation task, where the states and actions are specified using sorts cell and move. We use pr-atoms to represent the success rates of the unreliable movements, which are learned through model-based RL. The dialog system builds on our previous work (Lu et al., 2017). Figure 2 shows the probabilistic transitions in delivery tasks.
Learning from Navigation We use R-Max (Brafman and Tennenholtz, 2002), a model-based RL algorithm, to help our robot learn the success rate of navigation actions in different positions. The agent first initializes an MDP, from which it uses R-Max to learn the partial world model (of navigation tasks). Specifically, it initializes the transition function with T N (s, a, s v ) = 1.0, where s ∈ S and a ∈ A, meaning that starting from any state, after any action, the next state is always s v . The reward function is initialized with R(s, a) = R max , where R max is an upper bound of reward. The initialization of T N and R enables the learner to automatically balance exploration and exploitation. There is a fixed small cost for each navigation action. The robot receives a big bonus if it successfully achieves the goal (R max ), whereas it receives a big penalty otherwise (−R max ). A transition probability in navigation, T N (s, a, s ), is not computed until there are a minimum number (M) of transition samples visiting s . We recompute the action policy after E action steps.

Dialog-Navigation Connection
The update of knowledge base is achieved through updating the success rate of delivery actions serve(I,R,P) (in dialog task) using the success rate of navigation actions act move=M in different positions.
T D (s r , a d , s t ) = P N (s sp , s gl ), if s r a d P N (s sp , s mi )×P N (s mi , s sp )×P N (s sp , s gl ), if s r ⊗ a d where T D (s r , a d , s t ) is the probability of fulfilling request s r using delivery action a d ; s t is the "success" terminal state; s sp , s mi and s gl are states of the robot being in the shop, a misidentified goal position, and real goal position respectively; and P N (s, s ) is the probability of the robot successfully navigating from s to s positions. When s r and a d are aligned in all three dimensions (i.e., s r a d ), the robot needs to navigate once from the shop (s sp ) to the requested navigation goal (s gl ). P N (s sp , s gl ) is the probability of the corresponding navigation task. When the request and delivery action are not aligned in at least one dimension (i.e., s r ⊗ a d ), the robot has to navigate back to the shop to figure out the correct request, and then redeliver, resulting in three navigation tasks. Intuitively, the penalty of failures in a dialog subtask depends on the difficulty of the wrongly identified navigation subtask. For instance, a robot supposed to deliver to a near (distant) location being wrongly directed to a distant (near) location, due to a failure in the dialog subtask, will produce a higher (lower) penalty to the dialog agent.

Experiments
In this section, the goal is to evaluate our hypothesis that our KRR-RL framework enables a robot to learn from model-based RL, reason with both the learned knowledge and human knowledge, and dynamically construct task-oriented controllers. Specifically, our robot learns from navigation tasks, and applied the learned knowledge (through KRR) to navigation, dialog, and delivery tasks.
We also evaluated whether the learned knowledge can be represented and applied to tasks under different world settings. In addition to simulation experiments, we have used a real robot to demonstrate how our robot learns from navigation to perform better in dialog. Figure 3 shows the map of the working environment (generated using a real robot) used in both simulation and real-robot experiments. Human walkers in the blocking areas ("BA") can probabilistically impede the robot, resulting in different success rates in navigation tasks.
We have implemented our KRR-RL framework on a mobile robot in an office environment. As shown in Figure 3, the robot is equipped with two Lidar sensors for localization and obstacle avoidance in navigation, and a Kinect RGB-D camera for human-robot interaction. We use the Speech Application Programming Interface (SAPI) package (http://www.iflytek.com/en) for speech recognition. The robot software runs in the Robot Operating System (ROS) (Quigley et al., 2009). decided to confirm about the item, considering its unreliable language understanding capability; (c) After hearing "coke", the robot became more confident about the item, and decided to ask again about the goal room'; (d) After hearing "office2", the robot became confident about the whole request, and started to work on the task; (e) Robot was on the way to the kitchen to pick up the object; and (f) Robot arrived at the kitchen, and was going to pick up the object for delivery.
An Illustrative Trial on a Robot: Figure 4 shows the screenshots of milestones of a demo video, which will be made available given its acceptance. After hearing "a coke for Bob to office2", the three sub-beliefs are updated (turn1). Since the robot is aware of its unreliable speech recognition, it asked about the item, "Which item is it?" After hearing "a coke", the belief is updated (turn2), and the robot further confirmed on the item by asking "Should I deliver a coke?" It received a positive response (turn3), and decided to move on to ask about the delivery room: "Should I deliver to office 2?" After this question, the robot did not further confirm the delivery room, because it learned through model-based RL that navigating to office2 is relatively easy and it decided that it is more worth risking an error and having to replan than it is to ask the person another question. The robot became confident in three dimensions of the service request (<coke,Bob,office2> in turn4) without asking about person, because of the prior knowledge (encoded in P-log) about Bob's office. Figure 5 shows the belief changes (in the di-mensions of item, person, and room) as the robot interacts with a human user. The robot started with a uniform distribution in all three categories. It should be noted that, although the marginal distributions are uniform, the joint belief distribution is not, as the robot has prior knowledge such as Bob's office is office2 and people prefer deliveries to their own offices. Demo video is not included to respect the anonymous review process.
Learning to Navigate from Navigation Tasks In this experiment, the robot learns in the shop-room1 navigation task, and extracts the learned partial world model to the shop-room2 task.
It should be noted that navigation from shop to room2 requires traveling in areas that are unnecessary in the shop-room1 task. Figure 6 presents the results, where each data points corresponds to an average of 1000 trials. Each episode allows at most 200 (300) steps in small (large) domain. The curves are smoothed using a window of 10 episodes. The results suggest that with knowledge extraction (the dashed line) the robot learns faster than without extraction, and this performance improvement is more significant in a larger domain (the Right subfigure).
Learning to Dialog and Navigate from Navigation Tasks Robot delivering objects requires both tasks: dialog management for specifying service request (under unreliable speech recognition) and navigation for physically delivering objects (under unforeseen obstacles). Our office domain includes five rooms, two persons, and three items, resulting in 30 possible service requests. In the dialog manager, the reward function gives delivery actions a big bonus (80) if a request is fulfilled, and a big penalty (-80) otherwise.
General questions and confirming questions cost 2.0 and 1.5 respectively. In case a dialog does not end after 20 turns, the robot is forced to work on the most likely delivery. The cost/bonus/penalty values are heuristically set in this work, following guidelines based on studies from the literature on dialog agent behaviors (Zhang and Stone, 2015).   Table 1 reports the robot's overall performance in delivery tasks, which requires accurate dialog for identifying delivery tasks and safe navigation for object delivery. We conduct 10,000 simulation trials under each blocking rate. Without learning from RL, the robot uses a world model (outdated) that was learned under br = 0.3. With learning, the robot updates its world model in domains with different blocking rates. We can see, when learning is enabled, our KRR-RL framework produces higher overall reward, higher request fulfillment rate, and lower question-asking cost. The improvement is statistically significant, i.e., the p−values are 0.028, 0.035, and 0.049 for overall reward, when br is 0.1, 0.5, and 0.7 respectively (100 randomly selected trials with/without extraction).
Learning to Adjust Dialog Strategies from Navigation In the last experiment, we quantify the information collected in dialog in terms of entropy reduction. The hypothesis is that, using our KRR-RL framework, the dialog manager wants to collect more information before physically working on more challenging tasks. In each trial, we randomly generate a belief distribution over all possible service requests, evaluate the entropy of this belief, and record the suggested action given this belief.
We then statistically analyze the entropy values of beliefs, under which delivery actions are suggested.  Table 2 shows that, when br grows from 0.1 to 0.7, the means of belief entropy decreases (i.e., belief is more converged). This suggests that the robot collected more information in dialog in environments that are more challenging for navigation, which is consistent with Table 1 in the main paper. Comparing the three columns of results, we find the robot collects the most information before it delivers to room5. This is because such delivery tasks are the most difficult due to the location of room5. The results support our hypothesis that learning from navigation tasks enables the robot to adjust its information gathering strategy in dialog given tasks of different difficulties.

Adaptive Control in New Circumstances
The knowledge learned through model-based RL is contributed to a knowledge base that can be used for many tasks. So our KRR-RL framework enables a robot to dynamically generate partial world models for tasks under settings that were never experienced. For example, an agent does not know the current time is morning or noon, there are two possible values for variable "time". Consider that our agent has learned world dynamics under the times of morning and noon. Our KRR-RL framework enables the robot to reason about the two transition systems under the two settings and generate a new transition system for this "morning-or-noon" setting. Without our framework, an agent would have to randomly select one between the "morning" and "noon" policies.
To evaluate our policies dynamically constructed via KRR, we let an agent learn three controllers under three different environment settings -the navigation actions have decreasing success rates under the settings. In this experiment, the robot does not know which setting it is in (out of two that are randomly selected). The baseline does not have the KRR capability of merging knowledge learned from different settings, and can only randomly select a policy from the two (each corresponding to a setting). Experimental results show that the baseline agent achieved an average of 26.8% success rate in navigation tasks, whereas our KRR-RL agent achieved 83.8% success rate on average. Figure 7 shows the costs in a box plot (including min-max, 25%, and 75% values). Thus, KRR-RL enables a robot to effectively apply the learned knowledge to tasks under new settings.
Let us take a closer look at the "time" variable T . If T is the domain of T , the RL-only baseline has to compute a total of 2 |T | world models to account for all possible information about the value of T , where 2 |T | is the number of subsets of T . If there are N such variables, the number of world models grows exponentially to 2 |T |·N . In comparison, the KRR-RL agent needs to compute only |T | N world models, which dramatically reduces the number of parameters that must be learned through RL while retaining policy quality.

Conclusions and Future Work
We develop a KRR-RL framework that integrates computational paradigms of logical-probabilistic knowledge representation and reasoning (KRR), and model-based reinforcement learning (RL). Our KRR-RL agent learns world dynamics via modelbased RL, and then incorporates the learned dynamics into the logical-probabilistic reasoning module, which is used for dynamic construction of efficient run-time task-specific planning models. Experiments were conducted using a mobile robot (simulated and physical) working on delivery tasks that involve both navigation and dialog. Results suggested that the learned knowledge from RL can be represented and used for reasoning by the KRR component, enabling the robot to dynamically generate task-oriented action policies.
The integration of a KRR paradigm and modelbased RL paves the way for at least the following research directions. We plan to study how to sequence source tasks to help the robot perform the best in the target task (i.e., a curriculum learning problem within the RL context (Narvekar et al., 2017)). Balancing the efficiencies between service task completion and RL is another topic for further study -currently the robot optimizes for task completions (without considering the potential knowledge learned in this process) once a task becomes available. Fundamentally, all domain variables are endogenous, because one can hardly find variables whose values are completely independent from robot actions. However, for practical reasons (such as limited computational resources), people have to limit the number of endogenous. It remains an open question of how to decide what variables should be considered as being endogenous.