Symbol Grounding and Task Learning from Imperfect Corrections

This paper describes a method for learning from a teacher’s potentially unreliable corrective feedback in an interactive task learning setting. The graphical model uses discourse coherence to jointly learn symbol grounding, domain concepts and valid plans. Our experiments show that the agent learns its domain-level task in spite of the teacher’s mistakes.


Introduction
Interactive Task Learning (ITL) aims to develop agents that can learn arbitrary new tasks through a combination of their own actions in the environment and an ongoing interaction with a teacher (see Laird et al. (2017) for a survey). Because the agent continues to learn after deployment, ITL allows an agent to learn in an ever changing environment in a natural manner.
One goal of ITL is to have the interactions be as natural as possible for a human teacher, and many different modes of interaction have been studied: non-verbal through demonstration or teleoperation (Argall et al., 2009), or natural language: an embodied extended dialogue between teacher and agent, like between a teacher and apprentice. Our interest lies in natural language interactions where teachers can provide instructions (She et al., 2014), describe current states (Hristov et al., 2017) and define concepts (Scheutz et al., 2017), goals (Kirk and Laird, 2019), and actions (She et al., 2014), while the agent asks clarifying questions (She and Chai, 2017) and executes instructed commands. Teachers can also use corrective feedback (Appelgren and Lascarides, 2020). These approaches all assume that the teacher offers information that is both correct and timely. However, humans are error prone, and so in this paper we study how agents can learn successfully from corrective feedback even when the teacher makes mistakes.
Appelgren and Lascarides' model exploits discourse coherence (Hobbs, 1985;Kehler, 2002;Asher and Lascarides, 2003): that is, constraints on how a current move relates to its context. But their models assume that the teacher follows perfectly a specific dialogue strategy: she corrects a mistake as and when the agent makes it. However, humans may fail to perceive mistakes when they occur,. They also may, as a result, utter a correction much later than when the agent made the mistake, and thanks to the teacher being confident, but wrong, about the agent's capacity to ground NL descriptions to their referents, the agent may miscalculate which salient part of the context the teacher is correcting. In this paper, we present and evaluate an ITL model that copes with such errors.
In §2, we use prior work to motivate the task we tackle, as described in §3. We present our ITL model in §4 and §5, focusing on coping with situations where the teacher makes mistakes of the type we just described. We show in §6 that by making the model separate the appearance that the teacher's utterance coherently connects to the agent's latest action with the chance that it is not so connected, our agent can still learn its domain-level task effectively.

Background
Interactive Task learning (ITL) exploits interaction to support autonomous decision making during planning (Laird et al., 2017). Similar to Kirk and Laird (2019), our aim is to provide the agent with information about goals, actions, and concepts that allow it to construct a formal representation of the decision problem, which can thereafter be solved with standard decision making algorithms. Rather than teaching a specific sequence of actions (as in e.g., Nicolescu and Mataric (2001); She et al. (2014)), the teacher provides the infor-mation needed to infer a valid plan for a goal in a range of specific situations. In this work we focus on learning goals, which express constraints on a final state. The agent learns these goals by receiving corrective dialogue moves that highlight an aspect of the goal which the agent has violated (Appelgren and Lascarides, 2020).
Natural language (NL) can make ITL more data efficient than non-verbal demonstration alone: even simple yes/no feedback can be used to learn a reward function (Knox and Stone, 2009) or to trigger specific algorithms for improving behaviour (Nicolescu and Mataric, 2003). More extended NL phrases must map to semantic representations or logical forms that support inference (eg, Wang et al., 2016;Zettlemoyer and Collins, 2007). Like prior ITL systems, (eg, Forbes et al., 2015;Lauria et al., 2002) we assume our agent can analyse sentential syntax, restricting the possible logical forms to a finite set. But disambiguated syntax does not resolve semantic scope ambiguities or lexical senses (Copestake et al., 1999), and so the agent must use context to identify which logical form matches the speaker's intended meaning.
Recovering from misunderstandings has been addressed in dialogue systems (eg, Skantze, 2007), and ITL systems cope with incorrect estimates of denotations of NL descriptions (eg, Part and Lemon, 2019). Here, we address new sources of misunderstanding that stem quite naturally from the teacher attempting, but failing, to abide by a particular dialogue strategy: ie, to correct the agent's mistakes as and when they're made. This can lead to the learner misinterpreting the teacher's silence (silence might not mean the latest action was correct) or misinterpreting which action is being corrected (it might be an earlier action than the agent's latest one). We propose a model that copes with this uncertainty.

Task
In our task an agent must build towers in a blocks world. The agent begins knowing two PDDL action descriptions: put(x, y) for putting an object x on another y; and unstack(x, y) for removing an object x from another object y and placing x back on the table. Further, it knows the initial state consists of 10 individual blocks that are clear (i.e., nothing on them) and on the table, and that the goal G contains the fact that all the 10 blocks must be in a tower. Figure 1: The colours of objects fit into different colour terms. Each individual hue is generated from a Gaussian distribution, with mean and variance selected to produce hues described by the chosen colour term. There are high level categories like "red" and "green" and more specific ones like "maroon". This figure shows examples of hues generated in each category, including one that is both red and maroon.
However, putting the blocks in a tower is only a partial description of the true planning problem, and the agent lacks vital knowledge about the problem in the following ways. First, the true goal G includes further constraints (e.g., that each red block must be on a blue block) and the agent is unaware of which such constraints are truly in the goal. Further, and perhaps more fundamentally, the agent is also unaware of the colour terms used to define the constraints. I.e. the word "red" is not a part of the agent's natural language vocabulary, and so the agent does not know what "red" means or what particular set of RGB values the word denotes. Instead, the agent can only observe the RGB value of an object and must learn to recognise the colour through interaction with the teacher, and in particular the corrective dialogue moves that the teacher utters.
The possible goal constraints are represented in equations (1-2), where C 1 and C 2 are colour terms; e.g., "red" (r for short) and "blue" (b).
In words, r r,b 1 expresses that every red block must be on a blue block; r r,b 2 that every blue block should have a red one on it. These rules constrain the final tower, but thanks to the available actions, if a constraint is violated by a put action then it remains violated in all subsequent states unless that put action is undone by unstack.
In our experiments (see §6), a simulated teacher observes the agent attempting to build the tower, and when the agent executes an action that breaks one or more of the rules in the goal G, the teacher provides NL feedback-e.g., "no, red blocks should be on blue blocks". The feedback corrects the agent's action and provides an explanation as to why it was incorrect. However, linguistic syntax makes the sentence ambiguous between two rules-"red blocks should be on blue blocks" could mean r r,b 1 or r r,b 2 . Thus, the agent must disambiguate the teacher's message while simultaneously learning to ground new terms in the embodied environment, in this example the terms "red" and "blue". This latter task amounts to learning which RGB values are members of which colour concepts (see Figure 1).

Coherence
To learn from the teacher's feedback the agent reasons about how an utterance is coherent. In discourse each utterance must connect to a previous part of the discourse through a coherence relation, and the discourse relation which connects the two informs us what the contribution adds to the discourse. In our multimodal discourse each of the teacher's utterances u connect to one of the agent's actions a through the discourse relation "correction". The semantics of correction stipulate that the content of the correction is true and negates some part of the corrected action (Asher and Lascarides, 2003). In our domain, this means that the teacher will utter u if the agent's latest action a violates the rule that she intended u to express. If u = "no, red blocks should be on blue blocks" then, as previously stated, this is ambiguous between is r r,b 1 and r r,b 2 . So, a must violate one of these two rules: where CC(a, u) represents that u coherently corrects action a, G is the (true) goal, and V (r r,b 1 , a) represents that a violated r r,b 1 (similarly for V (r r,b 2 , a)). Since the semantics of correction is satisfied only if the correction is true, the rule the speaker intended to express must also be part of the true goal G; that is why (3) features r r,b 1 ∈ G (and r r,b 2 ∈ G) in the two disjuncts. There are two ways in which these rules can be violated. Either directly or indirectly. For r r,b 1 the rule requires every red block to be on a blue block, therefore it is directly violated by action Figure 2): The rule r r,b 2 requires all blue blocks to have red blocks on them, meaning that S 1 in Figure 2 does not directly violate the rule, but S 2 does because it is only violated when a blue block does not have a red block on it: So r r,b 1 is not directly violated in S 2 and r r,b 2 is not directly violated in S 1 but it would still be impossible to complete a rule compliant tower without undoing the progress that has been made on tower building. This is because the block which is currently not in the tower cannot be placed into the current tower in a rule compliant manner. For r r,b 1 in S 2 the red block needs a blue block to be placed on, but no such blue block exists. Similarly, for r r,b 2 in S 1 the blue block needs a red block to place on it, but no additional red blocks are available. In this way the rules are Indirectly violated in these states, which occurs when the number of available blocks of each colour makes it impossible to place all of those blocks: Our teacher signals if the error is due to a direct violation V D by pointing at the tower or an indirect violation V I by pointing at a block which cannot be placed in the tower any more (e.g., the blue block in S 1 or the red block in S 2 ).
When the agent observes the teacher say u = "no, put red blocks on blue blocks" it can make inferences about the world, with confidence in those inferences depending on its current knowledge. For example, if it knows with confidence which blocks are "red" or "blue", then it can infer via equations (4-7) which of the rules the teacher intended to convey. Alternatively, if the agent knows which rule was violated then the agent can infer the colour of the blocks. We use this in §5 to learn the task. However, if the agent is completely ignorant about the referents of the colour terms, it may not be able to make an inference at all. In this case it will ask for help by asking the colour of one of the blocks: "is the top block red?". Either answer to this question is suffiicent for the agent to disambiguate the intended message and also gain training exemplars (both positive and negative) for grounding the relevant colour terms.
If the teacher's dialogue strategy is to always correct an action which violates a rule (either directly or indirectly) directly after this incorrect action is executed, then the teacher's silence implies that the latest executed action does not violate a rule. This means that if the agent knows, for example, that if a green block is placed on a blue block then either green blocks must always be placed on blue blocks (r g,b 1 ) or no rule constraining green blocks exists. In this way the teacher's silence implies assent.

Faulty Teacher
We've laid out the what it means for something to be coherent, assuming that the teacher always acts in the most optimal way, correcting any action which violates a rule as soon as that action is performed. However, in general a human teacher will be unlikely to perfectly follow this strategy. Despite this, an agent would still have to attempt to learn from the teacher's utterances even though some of those utterances may not fit with the agent's expectations and understanding of coherence. In this paper we introduce two types of errors the teacher can make: (a) she fails to utter a correction when the latest action a violates a rule; and (b) she utters a correction when the most recent action does not violate a rule (perhaps because she notices a previous action she should have corrected). We think of (b) as adding a correction at the 'wrong' time.
Since a rule can violate an action in two ways- In our experiments we control in what way the teacher is faulty by assigning a probability with which the teacher performs each type of mistake, e.g. P M D represents the probability that the teacher misses a direct violation. Controlling these probabilities allows us to create different types of faulty teachers.
Due to the teacher's faultiness the agent must now reason about whether or not it should update its knowledge of the world given a teacher utterance or silence. In the following section we describe how we deal with this by creating graphical models which capture the semantics of coherence as laid out in this section.

System Description
An agent for learning from correction to perform the task described in §3 must be able to update its knowledge given the corrective feedback and then use that updated knowledge to select and execute a plan. The system we have built consists of two main parts: Action Selection and Correction Handling.

Action Selection
To generate a valid plan, the agent uses the Met-ricFF symbolic planner (Hoffmann and Nebel, 2001;Hoffmann, 2003)). It requires as input a representation of the current state, the goal, and the action descriptions (here, put(x, y) and unstack(x, y). The agent knows the position of objects, including which blocks are on each other, and it knows that the goal is to build a tower. However, the agent begins unaware of predicate symbols such as red and blue and ignorant of the rules r ∈ G that constrain the completed tower.
The aim of our system is to learn to recognise the colours-and so estimate the current state S * -and to identify the correct goal G, given the evidence X which it has observed so far. We describe how shortly. The agent uses its current knowledge to construct S * and G which are given as input to the planner to find a valid plan. Due to errors in the grounding models or goal estimate, this may fail: eg, if the agent estimates r r,b 1 ∈ G but there are more red blocks than blue blocks in S * , making it impossible to place all of the red blocks. In such cases, the agent recovers by searching in the probabilistic neighbourhood of S * for alternatives from which a valid plan for achieving G can be constructed (Appelgren and Lascarides, 2020). The agent executes each action in its plan until it's completed or the teacher gives corrective feedback. The latter triggers the Correction Handling system (see §5.2).

Grounding Models
The grounding models construct a representation of the current state S * by predicting the colour of blocks, given their visual features. Binary classifiers represent the probability of an object being a particular colour, e.g. P (Red x |F x ) where F x are the visual features of object x. We use binary classifiers over a categorical distribution for colour since the set of possible colours is unknown and colours aren't all mutually exclusive (e.g., maroon and red). We estimate the the probability using Bayes Rule: For P (F x |Red x = 0) we use a uniform distribution-we expect colours that are not red to be distributed over the entire spectrum. P (F x |Red x = 1) is estimated with weighted Kernel Density Estimation (wKDE). wKDE is a nonparametric model that puts a kernel around every known data point {(w 1 , F x 1 ), ...(w m , F xm )} (where w i are weights) and calculates the probability of a new data point via a normalised weighted sum of the values of the kernels at that point. With kernel ϕ (we use a diagonal Gaussian kernel), this becomes: The pairs (w i , F x i ) are generated by the Correction Handling system (see §5.2). Figure 3: The agent consists of an Action Selection system (yellow) and a Learning System (green). The former uses a symbolic planner to find a plan given the most likely goal and symbol grounding. The latter uses coherence to learn the goal and grounding.

The Goal
In order to estimate G the agent begins with the (correct) knowledge that it must place all blocks in a tower. However, it must use the teacher's feedback X to find the most likely set of additional rules which are also conjuncts in G (see §5.2): G = arg max r 1 ,...,rn P (r 1 ∈ G, . . . , r n ∈ G|X) (10) R = {r 1 . . . r n } is the set of possible rules that the agent is currently aware of, as determined by the colour terms it's aware of (so R gets larger during learning). For each r ∈ R, the agent tracks its probabilistic belief that r ∈ G. Due to the belief that any one rule being in the goal is unlikely, the priors for all r ∈ G are low: P (r ∈ G) = 0.1. And due to the independence assumption (11), the goal G is constructed by adding r ∈ R as a conjunct iff P (r ∈ G|X) > 0.5. P (r ∈ G, r ∈ G|X) = P (r ∈ G|X)P (r ∈ G|X)

Handling Corrections
When the teacher corrects the agent by uttering, for example, u = "no, red blocks should be on blue blocks" the agent must update its knowledge of the world in two ways: it must update its beliefs about what rules are in the goal, as described in §5.1.2 and it must update its models for grounding colour terms. To perform these inferences the agent builds a probabilistic model which allows it to perform these two inference. For the goal it uses the inference in equation (10). To learn the colours it performs this inference: And adds the data point (w, F (o 1 )) to its grounding model for red objects.
We base our graphical model on the model presented in Appelgren and Lascarides (2020) which we extend to deal with the fact that the teacher's utterance may be faulty. The model is a Bayes Net consisting of a number off different factors which are multiplied together to produce the final output probability. The model from Appelgren and Lascarides (2020) is shown in Figure 4. Grey nodes are observable while white nodes are latent. Arrows show conditional dependence between nodes. If the teacher is faultless then the agent observes that a coherent correction occurred: CC(a, u). The factor for this in the graphical model: P (CC(a, u)|r r,b 1 ∈ G, V (r r,b 1 , a), r r,b 2 ∈ G, V (r r,b 2 , a)) (13) captures equation (3), which stipulates that a coherent correction occurs when a rule which is in the goal is violated. In the graphical model the factor has a value of 1 any time this is true and 0 otherwise. The violation factors V (r r,b 1 , a) and V (r r,b 2 , a) represent whether or not a particular rule was violated by the action a. The agent cannot observe this directly, but must instead infer this from whether or not the objects are red and blue. As such the factor: captures equation (4) for i = 1 and (5) for i = 2. The value of the factor is 1 if the relevant equation holds and 0 otherwise. So, for example, when V (r r,b 1 , a) = T rue, Red o 1 = T rue, and Blue o 1 = F alse the value of the factor is 1.
The remaining nodes P (Red o 1 |F o 1 ) and P (Blue o 2 |F o 2 ) are defined by the agent's grounding models. P (F o i ) is a prior which is assumed to be a constant for all o i . Finally, P (r r,b i ∈ G) is the CC(a, u) Figure 4: The nodes added to the graphical model after a correction u = "no, red blocks should be on blue blocks".
agent's prior belief that r r,b i is in the goal (i = 1, 2). As we mentioned earlier, this is initially set to 0.1; however, the prior is updated each time the agent encounters a new planning problem instance. The prior is then set simply to the agent's current belief given the available evidence.
When the teacher designates a block o 3 on the table (thereby signaling that violation is indirect), the graphical model this generates is similar to Figure 4, save there are two additional nodes F o 3 and Red o 3 ∨ Blue o 3 (see (Appelgren and Lascarides, 2020) for details).
When the teacher stays silent the agent can make an inference which implies that no rule which is in the goal was violated. It can therefore build a model similar to Figure 4 which captures this negation of equation (3). The agent can then update its knowledge by making the same inferences when a correction occurs, but with the observed evidence being that no correction occurred. For further details on how this inference works see (Appelgren and Lascarides, 2020).

Uncertain Inferences
In this paper we assume that the teacher may make mistakes as described in §4. This introduces a novel problem for the agent since it can no longer assume that when the teacher says u that that means the utterance coherently attaches to the most recent action a. In other words, CC(a, u) becomes latent, rather than observable. What is observable is that the teacher did in fact utter correction u immediately after action a. We capture this by adding a new (observable) factor T eacherCorrection(a, u) (or T C(a, u) for short) to the graphical model. When the teacher is infallible T C(a, u) ≡ CC(a, u) but not when the teacher is fallible.
The updated model is shown in Figure 5. T C(a, u) is added as an observable node with CC(a, u) made latent. The factor for CC(a, u) still works in the same way as before, conforming to equation (3). T C(a, u) imposes no semantic constraints on a or on u. However, we can use the evidence of T C(a, u) to inform the agent's belief about whether CC(a, u) is true or not, i.e. whether it was actually coherent to utter u in response to a. The newly added factor P (T C(a, u)|CC(a, u) captures the agent's belief about how faulty the teacher is and allows the agent to therefore reason about whether T C(a, u) actually means that CC(a, u). In essence, it answers the question "if it is coherent to correct a with u, how likely is it that the teacher actually says u". So, if the agent believes that the teacher forgets to utter a correction with probability p = 0.1 then P (T C(a, u) = F alse|CC(a, u) = T rue) = 0.1. Or if the agent believes that the teacher will falsely correct an action which was actually correct 5% of the time then P (T C(a, u) = T rue|CC(a, u) = F alse) = 0.05. This allows the agent to make use of the fact that the teacher did (or didn't) utter something to still update its beliefs about which rules are in the goal and what the colour of objects are.
The agents beliefs about the teacher's fallibility could be estimated from data or could potentially be updated on the fly given the agent's observation of the interaction. However, for the purpose of the experiments in this paper we have direct access to the true probability of teacher fallibility since we explicitly set this probability ourselves. We therefore set the agent's belief about the teacher's fallibility to the true value.
The final change made to the system compared to Appelgren and Lascarides (2020) is to the way inference is done. In their paper they perform exact inference in a manner which was optimised for the structure of the graphical model and the incremental nature of the inference. However, the method relied on the fact that the majority of probability states had zero probability due to the deterministic factors in the model. When the teacher is fallible the number of zero probability states greatly falls. This leads to a situation where exact inference becomes impractical. To deal with this we deploy approximate inference, based on CC(a, u) T C(a, u) Figure 5: The nodes added to the graphical model after a correction u = "no, red blocks should be on blue blocks". Grey is observed and white latent.
a simple Bayesian Update together with a beam search method which relies on the fact that the model grows incrementally. We first find the probability for every atomic state in the newly added model chunk. This establishes a set of possible nonzero probability atomic states. These are combined with atomic states from the previous inference steps which we call the beam. The beam is the N most likely states from the previous state. Each new nonzero atomic state is combined with states from the beam if they are compatible, determined by both states having the same value for any overlapping variables. These new combined atomic states are evaluated on the full model and the N most likely are kept as a new beam, which is normalised to create a consistent probability distribution. Specific probabilities can then be calculated by summing all atomic states that match the chosen value, eg, where Red o 1 = T rue.

Experiments
In §4 we mentioned four types of teacher error and in our experiments we vary the level of the teacher's error in these different types. We believe the most likely is missing indirect errors (MI) since spotting these requires search on all possible future actions. So our first faulty teacher varies P M I ∈ {0.0, 0.1, 0.25, 0.5, 1.0}: ie, from no errors to never correcting any indirect violations at all. Our second teacher makes mistakes with direct violations. We believe missing and adding direct violations will be linked, so we experiment  with: P D = P M D = P AD ∈ {0.0, 0.1, 0.25}. We study two types of agents in our experiments. The first, baseline agent, BA, ignores the fact that the teacher may be faulty. It simply uses the model described in Appelgren and Lascarides (2020). The only difference is that since the teacher is actually making mistakes, sometimes the agent may be given contradictory evidence which would cause the inference to fail. In such a situation the agent would simply ignore everything that was said in the current scenario and move on to the next planning problem instance. The second agent is a mistakeaware agent, MA, which makes inferences using the model from §5.3, matching its belief about the teacher's faultiness to the true probability.
In our experiments each agent is given 50 planning problems. Each planning problem has a different goal and a different set of 50 planning problem instances. The agent is reset between each planning problem, but retains knowledge between the 50 problem instances. We measure the number of mistakes the agent makes, which we call regret. A mistake is counted when an action takes a tower from a state where it is possible to complete it in a rule compliant way to one where it isn't without unstacking blocks. In Figures 6 and 7 we present the mean performance over the 50 planning problems, and we use paired t-tests to establish significance.
Let's begin by looking at the results for agents learning from teachers that fail to make corrections for indirect violations, shown in Figure 7. Clearly when the teacher is faulty the agent performs worse (a result which is shown significant through a pairwise t-test and significance threshold p < 0.01). However, two interesting things can be observed. First, the slope of the curves are about the same for the agents learning from the faulty teacher and those learning from the faultless teacher. What this implies is that although the agent takes longer to learn the task when the teacher misses indirect violations it does seem to reach an equal proficiency by the end. We can explain the fact that the agent makes more mistakes by the fact that it is unaware of several mistakes it is making, however, when it is made aware of a mistake it still manages to learn. The second point is that the BA and MA agents are equally good at learning at all levels of teacher error. There is a good reason for this. When the teacher misses indirect violations the agent can actually trust all other information it receives. If it is given a direct correction then it knows for certain that the teacher give a coherent correction. This is true for all the feedback the agent receives when the only error the teacher makes is missing indirect violations. For this reason there isn't actually any need to change the way in which the agent learns, which is reassuring given that we believe the indirect violations to be more likely to happen in practice.
Looking at the results when the teacher will both miss and add corrections for direct violations, shown in Figure 6, we see that the agent's performance is much worse, both compared to the faultless agent and to the agents learning from the teachers making direct violations (these results are also significant given a pairwise t-test and significance threshold p < 0.01). The big difference in this case is that the agent BA performs much worse than MA, especially when the likelihood of failure is higher. This is true both if we look at the final number of mistakes, but also at the slope of the curve, indicating that the agent is still making more mistakes by the end of training. Figure 8 shows why: it compares the difference between the terminal regret for the faultless teacher vs. the faulty one. For BA there is a much larger spread of outcomes, with a long tail of very high regrets. The results for MA reside in a much narrower region. This implies that in contrast to MA, BA performs extremely badly in a significant number of cases. The high regret scenarios can be explained by situations where the agent has failed to learn the task successfully and is therefore acting almost randomly. So, making the agent mistake-aware stabilises the learning process, allowing the agent to recover from the teacher's mistakes without completely failing to learn the task, as seen in the baseline.

Conclusion
In this paper we present an ITL model where the agent learns constraints and concepts in a tower building task from a teacher uttering corrections to its actions. The teacher can make mistakes, and to handle this we introduce a separation between the teacher uttering a correction (observable) vs. whether that correction coherently relates to the latest action (latent). Our experiments showed that this separation significantly reduces the proportion of situations where the agent fails to learn; without the separation, learning can go catastrophically wrong when the teacher's mistakes involve direct violations.