Probabilistic Type Theory for Incremental Dialogue Processing

We present an adaptation of recent work on probabilistic Type Theory with Records (Cooper et al., 2014) for the purposes of modelling the incremental semantic processing of dialogue participants. After presenting the formalism and dialogue framework, we show how probabilistic TTR type judgements can be integrated into the inference system of an incremental dialogue system, and discuss how this could be used to guide parsing and dialogue management decisions


Introduction
While classical type theory has been the predominant mathematical framework in natural language semantics for many years (Montague, 1974, inter alia), it is only recently that probabilistic type theory has been discussed for this purpose. Similarly, type-theoretic representations have been used within dialogue models (Ginzburg, 2012); and probabilistic modelling is common in dialogue systems (Williams and Young, 2007, inter alia), but combinations of the two remain scarce. Here, we attempt to make this connection, taking (Cooper et al., 2014)'s probabilistic Type Theory with Records (TTR) as our principal point of departure, with the aim of modelling incremental inference in dialogue.
To our knowledge there has been no practical integration of probabilistic type-theoretic inference into a dialogue system so far; here we discuss computationally efficient methods for implementation in an extant incremental dialogue system. This paper demonstrates their efficacy in simple referential communication domains, but we argue the methods could be extended to larger domains and additionally used for on-line learning in future work.

Previous Work
Type Theory with Records (TTR) (Betarte and Tasistro, 1998;Cooper, 2005) is a rich type theory which has become widely used in dialogue models, including information state models for a variety of phenomena such as clarification requests (Ginzburg, 2012;Cooper, 2012) and nonsentential fragments (Fernández, 2006). It has also been shown to be useful for incremental semantic parsing (Purver et al., 2011), incremental generation (Hough and Purver, 2012), and recently for grammar induction (Eshghi et al., 2013).
While the technical details will be given in section 3, the central judgement in type theory s ∶ T (that a given object s is of type T ) is extended in TTR so that s can be a (potentially complex) record and T can be a record type -e.g. s could represent a dialogue gameboard state and T could be a dialogue gameboard state type (Ginzburg, 2012;Cooper, 2012). As TTR is highly flexible with a rich type system, variants have been considered with types corresponding to real-numbervalued perceptual judgements used in conjunction with linguistic context, such as visual perceptual information (Larsson, 2011;Dobnik et al., 2012), demonstrating its potential for embodied learning systems. The possibility of integration of perceptron learning (Larsson, 2011) and naive Bayes classifiers (Cooper et al., 2014) into TTR show how linguistic processing and probabilistic conceptual inference can be treated in a uniform way within the same representation system. Probabilistic TTR as described by Cooper et al. (2014) replaces the categorical s ∶ T judgement with the real number valued p(s ∶ T ) = v where v ∈ [0,1]. The authors show how standard probability theoretic and Bayesian equations can be applied to TTR judgements and how an agent might learn from experience in a simple classification game. The agent is presented with instances of a situation and it learns with each round by updating its set of probabilistic type judgements to best predict the type of object in focus -in this case updating the probability judgement that something is an apple given its observed colour and shape p(s ∶ T apple | s ∶ T Shp , s ∶ T Col ) where Shp ∈ {shp1, shp2} and Col ∈ {col1, col2}. From a cognitive modelling perspective, these judgements can be viewed as probabilistic perceptual information derived from learning. We use similar methods in our toy domain, but show how prior judgements could be constructed efficiently, and how classifications can be made without exhaustive iteration through individual type classifiers.
There has also been significant experimental work on simple referential communication games in psycholinguistics, computational and formal modelling. In terms of production and generation, Levelt (1989) discusses speaker strategies for generating referring expressions in a simple object naming game. He showed how speakers use informationally redundant features of the objects, violating Grice's Maxim of Quantity. In natural language generation (NLG), referring expression generation (REG) has been widely studied (see (Krahmer and Van Deemter, 2012) for a comprehensive survey). The incremental algorithm (IA) (Dale and Reiter, 1995) is an iterative feature selection procedure for descriptions of objects based on computing the distractor set of referents that each adjective in a referring expression could cause to be inferred. More recently Frank and Goodman (2012) present a Bayesian model of optimising referring expressions based on surprisal, the information-theoretic measure of how much descriptions reduce uncertainty about their intended referent, a measure which they claim correlates strongly to human judgements.
The element of the referring expression domain we discuss here is incremental processing. There is evidence from (Brennan and Schober, 2001)'s experiments that people reason at an incredibly time-critical level from linguistic information. They demonstrated self-repair can speed up semantic processing (or at least object reference) in such games, where an incorrect object being partly vocalized and then repaired in the instructions (e.g. "the yell-, uh, purple square") yields quicker response times from the onset of the target ("purple") than in the case of the fluent instructions ("the purple square"). This exam-ple will be addressed in section 5. First we will set out the framework in which we want to model such processing.

Probabilistic TTR in an incremental dialogue framework
In TTR (Cooper, 2005;Cooper, 2012), the principal logical form of interest is the record type ('RT' from here), consisting of sequences of fields of the form [ l ∶ T ] containing a label l and a type T . 1 RTs can be witnessed (i.e. judged as inhabited) by records of that type, where a record is a set of label-value pairs [ l = v ]. The central type judgement in TTR that a record s is of (record) type R, i.e. s ∶ R, can be made from the component type judgements of individual fields; e.g. the one- This is generalisable to records and RTs with multiple fields: a record s is of RT R if s includes fields with labels matching those occurring in the fields of R, such that all fields in R are matched, and all matched fields in s must have a value belonging to the type of the corresponding field in R. Thus it is possible for s to have more fields than R and for s ∶ R to still hold, but not vice-versa: s ∶ R cannot hold if R has more fields than s. Fields can have values representing predicate types (ptypes), such as T 3 in Figure 1, and consequently fields can be dependent on fields preceding them (i.e. higher) in the RT, e.g. l 1 is bound in the predicate type field l 3 , so l 3 depends on l 1 .

Subtypes, meets and joins A relation between
RTs we wish to explore is ⊑ ('is a subtype of'), which can be defined for RTs in terms of fields as simply: Figure 1, both R 1 ⊑ R 3 and R 2 ⊑ R 3 ; and R 1 ⊑ R 2 iff T 2 ⊑ T 2 ′ . The transitive nature of this relation (if R 1 ⊑ R 2 and R 2 ⊑ R 3 then R 1 ⊑ R 3 ) can be used effectively for type-theoretic inference.
We also assume the existence of manifest (singleton) types, e.g. T a , the type of which only a is a member. Here, we write manifest RT fields such as [ l ∶ T a ] where T a ⊑ T using the syntactic sugar [ l =a ∶ T ]. The subtype relation effectively allows progressive instantiation of fields (as addition of fields to R leads to R ′ where R ′ ⊑ R), which is practically useful for an incremental dialogue system as we will explain. We can also define meet and join types of two or more RTs. The representation of the meet type of two RTs R 1 and R 2 is the result of a merge operation (Larsson, 2010), which in simple terms here can be seen as union of fields. A meet type is also equivalent to the extraction of a maximal common subtype, an operation we will call M axSub(R i ..R n ): R 1 and R 2 here are common supertypes of the resulting R 1 ∧ R 2 . On the other hand, the join of two RTs R 1 and R 2 , the type R 1 ∨ R 2 cannot be represented by field intersection. It is defined in terms of type checking, in that While technically the maximally common supertype of R 1 and R 2 is the join type R 1 ∨ R 2 , here we introduce the maximally common simple (non disjunctive) supertype of two RTs R 1 and R 2 as field intersection: We will explore the usefulness of this new operation in terms of RT lattices in sec. 4.

Probabilistic TTR
We follow Cooper et al. (2014)'s recent extension of TTR to include probabilistic type judgements of the form p(s ∶ R) = v where v ∈ [0,1], i.e. the real valued judgement that a record s is of RT R. Here we use probabilistic TTR to model a common psycholinguistic experimental set up in section 5. We repeat some of Cooper et al.'s calculations here for exposition, but demonstrate efficient graphical methods for generating and incrementally retrieving probabilities in section 4. Cooper et al. (2014) define the probability of the meet and join types of two RTs as follows: It is practically useful, as we will describe below, that the join probability can be computed in terms of the meet. Also, there are equivalences between meets, joins and subtypes in terms of type judgements as described above, in that assuming The conditional probability of a record being of type R 2 given it is of type R 1 is: We return to an explanation for these classical probability equations holding within probabilistic TTR in section 4.

Learning and storing probabilistic judgements
When dealing with referring expression games, or indeed any language game, we need a way of storing perceptual experience. In probabilistic TTR this can be achieved by positing a judgement set J in which an agent stores probabilistic type judgements. 3 We refer to the sum of the value of probabilistic judgements that a situation has been judged to be of type R i within J as ∥R i ∥ J and the sum of all probabilistic judgements in J simply as P (J); thus the prior probability that anything is of type R i under the set of judgements J is The conditional probability p(s ∶ R 1 | s ∶ R 2 ) under J can be reformulated in terms of these sets of judgements: where the sample spaces ∥R 1 ∧ R 2 ∥ J and ∥R 2 ∥ J constitute the observations of the agent so far. J can have new judgements added to it during learning. We return to this after introducing the incremental semantics needed to interface therewith.

DS-TTR and the DyLan dialogue system
In order to permit type-theoretic inference in a dialogue system, we need to provide suitable TTR representations for utterances and the current pragmatic situation from a parser, dialogue manager and generator as instantaneously and accurately as possible. For this purpose we use an incremental framework DS-TTR (Eshghi et al., 2013;Purver et al., 2011) which integrates TTR representations with the inherently incremental grammar formalism Dynamic Syntax (DS) (Kempson et al., 2001). DS-TTR's monotonicity, each maximal RT of the tree's root node is a subtype of the parser's previous maximal output.
Following (Eshghi et al., 2013), DS-TTR tree nodes include a field head in all RTs which corresponds to the DS tree node type. We also assume a neo-Davidsonian representation of predicates, with fields corresponding to an event term and to each semantic role; this allows all available semantic information to be specified incrementally in a strict subtyping relation (e.g. providing the subj() field when subject but not object has been parsed) -see Figure 2.
We implement DS-TTR parsing and generation mechanisms in the DyLan dialogue system 5 within Jindigo (Skantze and Hjalmarsson, 2010), a Java-based implementation of the incremental unit (IU) framework of (Schlangen and Skantze, 2009). In this framework, each module has input and output IUs which can be added as edges between vertices in module buffer graphs, and become committed should the appropriate conditions be fulfilled, a notion which becomes important in light of hypothesis change and repair situations. Dependency relations between different graphs within and between modules can be specified by groundedIn links (see (Schlangen and Skantze, 2009) for details).
The DyLan interpreter module (Purver et al., 2011) uses Sato (2011's insight that the context of DS parsing can be characterized in terms of a Directed Acyclic Graph (DAG) with trees for nodes and DS actions for edges. The module's state is characterized by three linked graphs as shown in Figure 3: • input: a time-linear word graph posted by the ASR module, consisting of word hypothesis edge IUs between vertices W n • processing: the internal DS parsing DAG, which adds parse state edge IUs between vertices S n groundedIn the corresponding word hypothesis edge IU • output: a concept graph consisting of domain concept IUs (RTs) as edges between vertices C n , groundedIn the corresponding path in the DS parsing DAG Here, our interest is principally in the parser output, to support incremental inference; a DS-TTR generator is also included which uses RTs as goal concepts (Hough and Purver, 2012) and uses the same parse graph as the interpreter to allow selfmonitoring and compound contributions, but we omit the details here.

RT lattices to encode domain knowledge
To support efficient inference in DyLan, we represent dialogue domain concepts via partially ordered sets (posets) of RT judgements, following similar insights used in inducing DS-TTR actions (Eshghi et al., 2013). A poset has several advantages over an unordered list of un-decomposed record types: the possibility of incremental type-checking; increased speed of type-checking, particularly for pairs of/multiple type judgements; immediate use of type judgements to guide system decisions; inference from negation; and the inclusion of learning within a domain. We leave the final challenge for future work, but discuss the others here. We can construct a poset of type judgements for any single RT by decomposing it into its constituent supertype judgements in a record type lattice. Representationally, as per set-theoretic lattices, this can be visualised as a Hasse diagram such as Figure 4, however here the ordering arrows show ⊑ ('subtype of') relations from descendant to ancestor nodes.
To characterize an RT lattice G ordered by ⊑, we adapt Knuth (2005)'s description of lattices in line with standard order theory: for a pair of RT elements R x and R y , their lower bound is the set of all R z ∈ G such that R z ⊑ R x and R z ⊑ R y . In the event that a unique greatest lower bound exists, this is their meet, which in G happily corresponds to the TTR meet type R x ∧ R y . Dually, if their unique least upper bound exists, this is their Figure 4: Record Type lattice ordered by the subtype relation join and in TTR terms is M axSuper(R x , R y ) but not necessarily their join type R x ∨ R y as here we concern ourselves with simple RTs. One element covers another if it is a direct successor to it in the subtype ordering relation hierarchy. G has a greatest element (⊤) and least element (⊥), with the atoms being the elements that cover ⊥; in Figure 4 if R 1 is viewed as ⊥ , the atoms are Figure 4 is complemented as this holds for every element). Graphically, the join of two elements can be found by following the connecting edges upward until they first converge on a single RT, giving us M axSuper(R 10 , R 12 ) = R 121 in Figure 4, and the meet can be found by following the lines downward until they connect to give their meet type, i.e. R 10 ∧ R 12 = R 1 .
If we consider R 1 to be a domain concept in a dialogue system, we can see how its RT lattice G can be used for incremental inference. As incrementally specified RTs become available from the interpreter they are matched to those in G to determine how far down towards the final domain concept R 1 our current state allows us to be. Different sequences of words/utterances lead to different paths. However, any practical dialogue system must entertain more than one possible domain concept as an outcome; G must therefore contain multiple possible final concepts, constituting its atoms, each with several possible dialogue move sequences, which correspond to possible downward paths -e.g. see the structure of Figure 5. Our aim here is to associate each RT in G with a probabilistic judgement.

Initial lattice construction
We define a simple bottom-up procedure in Algorithm 1 to build a RT lattice G of all possible simple domain RTs and their prior probabilistic judgements, initialised by the disjunction of possible final state judgements (the priors), 6 along with the absurdity ⊥, stipulated a priori as the least element with probability 0 and the meet type of the atomic priors. The algorithm recursively removes one field from the RT being processed at a time (except fields referenced in a remaining dependent ptype field), then orders the new supertype RT in G appropriately. Each node in G contains its RT R i and a sum of probability judgements {∥R k ∥ J + .. + ∥R n ∥ J } corresponding to the probabilities of the priors it stands in a supertype relation to. These sums are propagated up from child to parent node as it is constructed. It terminates when all simple maximal supertypes 7 have been processed, leaving the maximally common supertype as ⊤ (possibly the empty type [ ]), associated with the entire probability mass P (J), which constitutes the denominator to all judgements-given this, only the numerator of equation J) needs to be stored at each node.

INPUT: priors
▷ use the initial prior judgements for G's atoms OUTPUT: G G = newGraph(priors) ▷ P(J) set to equal sum of prior probs agenda = priors ▷ Initialise agenda while not agenda is empty do RT = agenda.pop() for field ∈ RT do if field ∈ RT.paths then ▷ Do not remove bound fields continue superRT = RT -field if superRT ∈ G then ▷ not new? order w.r.t. RT and inherit RT's priors G.order(RT.address,G.getNode(superRT),⊑) else ▷ new? superNode = G.newNode(superRT) ▷ create new node w. empty priors for node ∈ G do ▷ order superNode w.r.t. other nodes in G if superRT.fields ⊂ node.fields then G.order(node,superNode,⊑) ▷ superNode inherits node's priors agenda.append(superRT) ▷ add to agenda for further supertyping Direct inference from the lattice To explain how our approach models incremental inference, we assume Brennan and Schober (2001)'s experimental referring game domain described in section 2: three distinct domain situation RTs R 1 , R 2 and R 3 correspond to a purple square, a yellow square and a yellow circle, respectively. The RT lattice G constructed initially upon observation of the game (by instructor or instructee) shown in Figure 5 uses a uniform distribution for the three disjunctive final situations. Each node shows an RT R i on the left and the derivation of its prior probability p J (R i ) that any game situation record will be of type R i on the right, purely in terms of the relevant priors and the global denominator P (J).
G can be searched to make inferences in light of partial information from an ongoing utterance. We model inference as predicting the likelihood of relevant type judgements R y ∈ G of a situation s, given the judgement s ∶ R x we have so far. To do this we use conditional probability judgements following Knuth's work on distributive lattices, using the ⊑ relation to give a choice function: The third case is the degree of inclusion of R y in R x , and can be calculated using the conditional probability calculation (4) in sec. 3. For negative RTs, a lattice generated from Algorithm 1 will be distributive but not guaranteed to be complemented, however we can still derive p J (s ∶ R y | s ∶ ¬R x ) by obtaining p J (s ∶ R y ) in G modulo the probability mass of R x and that of its subtypes: The subtype relations and atomic, join and meet types' probabilities required for (1) -(6) can be calculated efficiently through graphical search algorithms by characterising G as a DAG: the reverse direction of the subtype ordering edges can be viewed as reachability edges, making ⊤ the source and ⊥ the sink. With this characterisation, if R x is reachable from R y then R x ⊑ R y .
In DAG terms, the probability of the meet of two RTs R x and R y can be found at their highest common descendant node -e.g. p J (R 4 ∧ R 5 ) in Figure 5 can be found as 1 3 directly at R 1 . Note if R x is reachable from R y , i.e. R x ⊑ R y , then due to the equivalences listed in (2), p J (R x ∧ R y ) can be found directly at R x . If the meet of two nodes is ⊥ (e.g. R 4 and R 3 in Figure 5), then their meet probability is 0 as p J (⊥)=0.
While the lattice does not have direct access to the join types of its elements, a join type probability p J (R x ∨ R y ) can be calculated in terms of p J (R x ∧ R y ) by the join equation in (1), which holds for all probabilistic distributive lat-PRIORS: Figure 5: Record type lattice with initial uniform prior probablities tices (Knuth, 2005). 8 As regards efficiency, worst case complexity for finding the meet probability at the common descendant of R x and R y is a linear O(m + n) where m and n are the number of edges in the downward (possibly forked) paths R x → ⊥ and R y → ⊥. 9

Simulating incremental inference and self-repair processing
Interpretation in DyLan and its interface to the RT lattice G follows evidence that dialogue agents parse self-repairs efficiently and that repaired dialogue content (reparanda) is given special status but not removed from the discourse context. To model Brennan and Schober (2001)'s findings of disfluent spoken instructions speeding up object recognition (see section 2), we demonstrate a self-repair parse in Figure 6 for "The yell-, uh, purple square" in the simple game of predicting the final situation from {R 1 , R 2 , R 3 } continuously given the type judgements made so far. We describe the stages T1-T4 in terms of the current word being processed-see Figure 6: At T1:'the' the interpreter will not yield a subtype checkable in G so we can only condition on R 8 (⊤), giving us p J (s ∶ R i | s ∶ R 8 ) = 1 3 for i ∈ {1, 2, 3}, equivalent to the priors. At T2: 8 The search for the meet probability is generalisable to conjunctive types by searching for the conjuncts' highest common descendant. The join probability is generalisable to the disjunctive probability of multiple types, used, albeit programatically, in Algorithm 1 for calculating a node's probability from its child nodes. 9 While we do not give details here, simple graphical search algorithms for conjunctive and disjunctive multiple types are linear in the number of conjuncts and disjuncts, saving considerable time in comparison to the algebraic calculations of the sum and product rules for distributive lattices.
'yell-', the best partial word hypothesis is now "yellow"; 10 the interpreter therefore outputs an RT which matches the type judgement s ∶ R 6 (i.e. that the object is a yellow object). Taking this judgement as the conditioning evidence using function (5) we get p J (s ∶ R 1 | s ∶ R 6 ) = 0 and using (4) we get p J (s ∶ R 2 | s ∶ R 6 ) = 0.5 and p J (s ∶ R 3 | s ∶ R 6 ) = 0.5 (see the schematic probability distribution at stage T2 in Figure 6 for the three objects). The meet type probabilities required for the conditional probabilities can be found graphically as described above. At T3:'uh purple', low probability in the interpreter output causes a self-repair to be recognised, enforcing backtracking on the parse graph which informally operates as follows (see Hough and Purver (2012)) :

Self-repair:
IF from parsing word W the edge SE n is insufficiently likely to be constructed from vertex S n OR IF there is no sufficiently likely judgement p(s ∶ R x ) for R x ∈ G THEN parse word W from vertex S n−1 . IF successful add a new edge to the top path, without removing any committed edges beginning at S n−1 ; ELSE set n = n−1 and repeat.
This algorithm is consistent with a local model for self-repair backtracking found in corpora (Shriberg and Stolcke, 1998;. As regards inference in G, upon detection of a self-repair that revokes s ∶ R 6 , the type judgement s ∶ ¬R 6 , i.e. that this is not a yellow object, is immediately available as conditioning evidence. Using (6) our distribution of RT judgements now shifts: p J (s ∶ R 1 | s ∶ ¬R 6 ) = 1, p J (s ∶ R 2 | s ∶ ¬R 6 ) = 0 and p J (s ∶ R 3 | s ∶ ¬R 6 ) = 0 before "purple" has been parsed -thus providing a probabilistic explanation for increased subsequent processing speed. Finally at T4: 'square' given p J (s ∶ R 1 | s ∶ R 1 ) = 1 and R 1 ∧R 2 = R 1 ∧R 3 = ⊥, the distribution remains unchanged.
The system's processing models how listeners reason about the revocation itself rather than predicting the outcome through positive evidence alone, in line with (Brennan and Schober, 2001)'s results.

Extensions
Dialogue and self-repair in the wild To move towards domain-generality, generating the lattice of all possible dialogue situations for interesting domains is computationally intractable. We intend instead to consider incrementally occurring issues that can be modelled as questions (Larsson, 2002). Given one or more issues manifest in the dialogue at any time, it is plausible to generate small lattices dynamically to estimate possible answers, and also assign a real-valued relevance measure to questions that can be asked to resolve the issues. We are exploring how this could be implemented using the inquiry calculus (Knuth, 2005), which defines information theoretic relevance in terms of a probabilistic question lattice, and furthermore how this could be used to model the cause of self-repair as a time critical trade-off between relevance and accuracy.

Learning in a dialogue
While not our focus here, lattice G's probabilities can be updated through observations after its initial construction. If a reference game is played over several rounds, the choice of referring expression can change based on mutually salient functions from words to situations-see e.g. (DeVault and Stone, 2009). Our currently frequentist approach to learning is: given an observation of an existing RT R i is made with probability v, then ∥R i ∥ J , the overall denominator P (J) , and the nodes in the upward path from R i to ⊤ are incremented by v. The approach could be converted to Bayesian update learning by using the prior probabilities in G for calculating v before it is added. Furthermore, observations can be added to G that include novel RTs: due to the DAG structure of G, their subtype ordering and probability effects can be integrated efficiently.

Conclusion
We have discussed efficient methods for constructing probabilistic TTR domain concept lattices ordered by the subtype relation and their use in incremental dialogue frameworks, demonstrating their efficacy for realistic self-repair processing. We wish to explore inclusion of join types, the scalability of RT lattices to other domains and their learning capacity in future work.