Reference-Centric Models for Grounded Collaborative Dialogue

We present a grounded neural dialogue model that successfully collaborates with people in a partially-observable reference game. We focus on a setting where two agents each observe an overlapping part of a world context and need to identify and agree on some object they share. Therefore, the agents should pool their information and communicate pragmatically to solve the task. Our dialogue agent accurately grounds referents from the partner’s utterances using a structured reference resolver, conditions on these referents using a recurrent memory, and uses a pragmatic generation procedure to ensure the partner can resolve the references the agent produces. We evaluate on the OneCommon spatial grounding dialogue task (Udagawa and Aizawa 2019), involving a number of dots arranged on a board with continuously varying positions, sizes, and shades. Our agent substantially outperforms the previous state of the art for the task, obtaining a 20% relative improvement in successful task completion in self-play evaluations and a 50% relative improvement in success in human evaluations.


Introduction
In grounded dialogue settings involving high degrees of ambiguity, correctly interpreting and informatively generating language can prove challenging. Consider the collaborative dialogue game shown in Figure 1. Each player has a separate, but overlapping, view on an underlying context. They need to communicate to determine and agree on one dot that they share, and both players win if they choose the same dot. To succeed, each participant must-implicitly or explicitly-ground their partner's descriptions to their own context, maintain a history of what's been described and what their partner is likely to have, and informatively convey parts of their own context.
We present a grounded pragmatic dialogue system which collaborates successfully with people Underlying Hidden State Agent's View Partner's View A: I have two large light grey dots with a smaller darker dot below and to the right of it P: I have one large lightest grey dot in my entire view A: I have two large light grey dots P: There is a smaller slightly darker grey dot next to the lightest grey and larger dot A: Yes , let's pick the light one P: The light grey and large dot? A: Yes it 's the one . Let's pick that one P: SELECT red A: SELECT blue Figure 1: An example dialogue produced by our system (A) with a human partner (P). The participants have different but overlapping views of a shared board, which contains dots of different shapes and sizes. The partners must collaborate through dialogue in order to find and select a dot that is visible to both. on the task above. Figure 1 shows a real example game between our system and a human partner. Our approach is centered around a structured module for perceptually-grounded reference resolution. This reference resolution module plays two roles. First, the module is used to interpret the partner's utterances: explicitly predicting which referents (if any) in the agent's context the partner is referring to, for example a smaller darker grey dot and the lightest grey and larger dot. Second, the reference module is used for pragmatic generation: choosing utterances by reasoning about how the partner might interpret them in context. Our pragmatic generation procedure selects referents to describe as well as choosing how to describe them, for example focusing on the light one ( Figure 1).
Much past work that has constructed systems for grounded collaborative dialogue has focused on settings that have asymmetric player roles (Kim et al., 2019;de Vries et al., 2018;Das et al., , 2017, are fully-observable, or are grounded in symbolic attributes (He et al., 2017). In contrast, we focus on the ONECOMMON corpus and task (Udagawa and Aizawa, 2019), which is symmetric, partially-observable, and has relatively complex spatial and perceptual grounding. These traits necessitate complex dialogue strategies such as common grounding, coordination, clarification questions, and nuanced acknowledgment (Udagawa and Aizawa, 2019), leading to the task being challenging even for pairs of human partners.
Past work on ONECOMMON has focused on the subtask of reference resolution (Udagawa and Aizawa, 2020;Udagawa et al., 2020) and only evaluated dialogue systems automatically: using static evaluation on human-human games and self-play evaluations that simulate human partners using another copy of the agent. Our system outperforms this past work on these evaluations. We further confirm these results by performing-for the first time on this task-human evaluations, where we find that our system obtains a 50% relative increase in success rate over a system from past work when paired with human partners. We release code for our system at https://github.com/dpfried/onecommon.

Setting
We choose to focus on the ONECOMMON task (Udagawa and Aizawa, 2019) since it is a particularly challenging representative of a class of partially-observable collaborative reference dialogue games (e.g., He et al. 2017;Haber et al. 2019). In this task, two players have different but overlapping views of a game board, which consists of dots of various positions, shades of gray and sizes. The players must coordinate to choose a single dot that both players can see, which is challenging because neither knows which dots the other can see.
Each player's world view, w, consists of a circular view on an underlying board containing between 8 and 10 randomly scattered dots, with continuously varying positions, shades, and sizes (Figure 1). Each player's view contains 7 dots, and the views of the players overlap so that there are between 4 and 6 dots which appear in both views.
We focus on a turn-based version of the dialogue task. In a given turn t, a player may communicate with their partner by either sending an utterance u t or selecting a dot s. In the event of selection, the partner is notified but cannot see which dot the player has selected. Once a player has selected a dot, they can no longer send messages. The dialogue ends once both players have selected a dot, and is successful if both selected the same one.

Model Structure
Our approach is a modular neural dialogue model which factors the agent's generation process into a series of successive subtasks, all centered on grounding language into referents in the world context. In this section, we describe our model structure, which defines a neural module for each subtask. We then describe our reference-centric pragmatic generation procedure in Section 4.
An overview of the relationship between modules in our model is shown in Figure 2. Each module can condition on neural encodings of the context (the world and past dialogue), as well as the outputs of other modules. We describe our system at a high-level here, then give task-specific implementation details about each component in Section 5.

Context Encodings
Our modules can condition on encodings of (i) the past utterances u 1:t in the dialogue, represented as a memory vector H t produced by a word-level recurrent encoder and (ii) the continuous dots in the world context w, produced by the entity encoding network of Udagawa and Aizawa (2020), which produces a vector w(d) for each dot d encoding the dot's continuous attributes as well as its pairwise attribute relationships to all other dots in the context (Santoro et al., 2017). (i) and (ii) both follow Udagawa and Aizawa (2020). To explicitly encourage the model to retain and use information about the history of referents mentioned by both players, which affects the choice of future referents as well as the selection of dot at the end of the game, we also use (iii) a structured recurrent referent memory grounded in the context. This memory, inspired by He et al. (2017), has one representation for each dot d in the agent's view, M t (d), which is updated based on the referents predicted in turn t. See Section 5.4 for details.  Figure 2: In a given turn, an agent first identifies referring expressions in their partner's utterance u t using the reference detector (1). Each reference is then resolved with the reference resolution module (2), which uses encoded representations z 1:Kt of the reference segments and the world context w. The referents are then used to update the referent memory M t , and cross-referenced against the agent's own dots to confirm whether or not the agent can also see them (3). Given the referent memory M t and confirmation variable c t+1 , the mention prediction module (4) produces a sequence of dot configurations z 1:Kt+1 t+1 to mention. Finally, the utterance generation module (5) uses the dialog history H t , confirmation variable, and attended representations of the selected mentions and world context to generate a response u t+1 .

Decomposing Turns into Subtasks
We assume turn t+1 in the dialogue has the following generative process (numbers correspond to Figure 2). Steps (1) and (2) identify and resolve referring expressions in the partner's utterance u t ; step (3) updates the memory and determines whether the model can confirm any referents from the partner's utterance; steps (4) and (5) produce the agent's next utterance u t+1 .
(1) First, a sequence of K t (with K t ≥ 0) referring expressions are identified in u t using the reference detector tagging model of Udagawa and Aizawa (2020) 1 , and encodings z t = z 1:Kt t are obtained for them by pooling features from a recurrent utterance encoder.
(2) Then, the referring expressions are grounded. From each referring expression's features z k , we predict a referent r k , which is the set of zero or more dots in the agent's own view which are described by the referring expression. For example, the referring expression three gray dots corresponds to a single referent containing three dots. A reference resolution module P R (r t | z t , w, M ), where r t = r 1:Kt t , predicts a sequence of referents, one for each referring expression.
(3) Given these referents, the agent updates the referent memory M t using the predicted referents and constructs a discrete confirmation variable c t+1 , which indicates whether the agent can confirm in 1 Udagawa and Aizawa refer to this as a markable detector given their work's focus on referent annotation. its next utterance that it has all the referents the partner is describing (e.g., Yes, I see that). c t+1 takes on one of three values: NA if no referring expressions were in the partner's utterance, YES if all of the partner's referring expressions have referents that are at least partially visible in the agent's view, and NO otherwise.
(4) The agent chooses a sequence of referents to mention next using a mention prediction module P M (r t+1 | c t+1 , M t+1 , H t , w).
At the end of the dialogue (turn T ), the agent selects a dot s using a choice selection module P S (s | H T , M T , w) (not shown in Figure 2). 2 Modules that predict referents (reference resolution, mention selection, and choice selection) are all implemented using a structured conditional random field (CRF) architecture (Section 5.2), with independent parameterizations for each module.
Our model bears some similarities to Udagawa and Aizawa (2020)'s neural dialogue model for this task: both models use a reference resolution module 3 and both models attend to similar encodings of the dots in the agent's world view (w(d)) when generating language. Crucially, however, our decomposition of generation into subtasks results in a factored, hierarchical generation procedure: our model identifies and then conditions on previouslymentioned referents from the partner's utterances, 4 maintains a structured referent memory updated at each utterance, and explicitly predicts which referents to mention in each of the agent's own utterances. In Section 4, we will see how factoring the generation procedure in this way allows us to use a pragmatic generation procedure, and in Section 6 we find that each of these components improves performance.

Pragmatic Generation
The modules as described above can be used to generate the next utterance u t+1 using the predictions of P M (r t+1 ) and P U (u t+1 |r t+1 ) (omitting other conditioning variables from the notation for brevity; see Section 3 for the full conditioning contexts). This section describes an improvement, pragmatic generation, to this process. Referents and their expressions should be relevant in the dialogue and world context, but they should also be discriminative (Dale, 1989): allowing the listener to easily understand which referents the speaker is intending to describe. Our pragmatic generation approach, based on the Rational Speech Acts (RSA) framework (Frank and Goodman, 2012;Goodman and Frank, 2016), uses the reference resolution module, P R (r t+1 |u t+1 ), to predict whether the partner can identify the intended referents. This encourages selecting referents that are easy for the partner to identify and describing them informatively in context. 5 We use the following objective over referents r and utterances u for a given turn: where w M , w S , and w L are hyperparameters. 4 Udagawa and Aizawa used the reference resolution module only to define an auxiliary loss at training time. 5 Note that the reference resolution model, which has access to the agent's own view and not the partner's, can only approximate whether the referents are identifiable by the partner; nevertheless we find that it is beneficial for pragmatic generation. Future work might explore also inferring and using the partner's view.

Mention Prediction
Utterance Generation (1) Yes, I see that. Let's select the grey one.
(2) Yes, I see that. Let's select the right one.
(1) Yes, I see that. Let's select the black one.
(2) Yes, I see that. Let's select the left one.
(3) Yes, I see that. Let's select the middle. ... (1) (2) ✔ ✔ Utterance Generation Figure 3: Agents optimize for a combination of fluency and informativity during pragmatic utterance generation (Section 4 and Algorithm 1). A set of paired candidate referents (from the mention prediction module) and utterances (from the utterance generation module) is rescored using L(r, u) (Equation 1), a weighted geometric mean of scores from the mention prediction, utterance, and reference resolution modules. The pair of referent and utterance that maximizes this score is chosen as a response.
This objective generalizes the typical RSA setup (as implemented by the weighted pragmatic inference objective of e.g., Andreas andKlein 2016 andMonroe et al. 2017), which chooses how to describe a given context (i.e., choosing an utterance u), to also choose what context to describe (i.e., choosing the referents r). Our objective also models the tradeoff, explored in past work on referring expression generation (Dale, 1989;Jordan and Walker, 2005;Viethen et al., 2011), between producing utterances relevant in the discourse and world context and producing utterances that are discriminative. We use P U and P M to model discourse and world relevance, P S to model discriminability, and the weights w to empirically model the tradeoff between them.
Given the combinatorially-large spaces of possible r and u, we rely on an early-stopping approximate search, which to our knowledge is novel for RSA. The search (illustrated in Figure 3) iterates through the highest probability structured referent sequences r under the mention prediction module P M ( Figure 3 shows the top two) and for each r sampling N u utterances u from the utterance generation module ( Figure 3 shows three u per r). If the maximum of these (r, u) pairs under L is better than an early-stopping threshold value τ , we return the pair. Otherwise, we continue on to the next r. If more than N r referent sequences have been evaluated, we return the best (r, u) pair found so far. See Appendix C for pseudocode and a discussion of robustness to the threshold τ .
As described so far, our system is applicable to a range of partially-observable grounded collaborative referring expression dialogue tasks (e.g., He et al. 2017;Haber et al. 2019). In this section, we describe implementations of our systems' modules, some of which are tailored to ONECOMMON.

Reference Detection
We identify a sequence of referring expressions in the utterance using the reference detector of Udagawa and Aizawa (2020), a BiLSTM-CRF tagger (Huang et al., 2015). Then, following Udagawa and Aizawa (2020), we obtain features z k for each of the K referring expressions in the utterance (for use in the reference resolution model) with a bidirectional recurrent encoder, using learned weights to pool the encodings at the referring expression's boundaries as well as the end of the utterance. 6

Structured Reference Resolution
We use a structured reference resolution module to ground the referring expressions identified above: identifying dots in the agent's own view described by each expression. Grounding referents in this domain involves reasoning not only about attributes of individual dots but also spatially and comparatively within a single referring expression (e.g., a line of three dots) or across referring expressions (e.g., a large grey dot left of a smaller dot).
To predict a sequence of referents r = r 1:K from the K referring expression representations z 1:K extracted above, we use a linear-chain CRF (Lafferty et al., 2001) with neural potentials to parameterize P R (r 1:K |z 1:K , w, M ). This architecture generalizes the reference resolution and choice selection models of Udagawa and Aizawa (2020) and Udagawa et al. (2020) to model, in the output structure, relationships between dots, both inside and across referring expressions.
There are three different types of potentials, designed to model language-conditioned features of individual dots d in a referent r, φ; relationships within a referent, ψ, and transitions between successive referents, ω. Given these potentials, the distribution is parameterized as where f (r, z) = d∈r φ (d, z), and we've omitted the dependence of all terms on M and w for brevity. We share all module parameters across the two subtasks of resolving referents for the agent and for the partner. 7 Individual Dots. Dot potentials φ model the correspondence between language features z k and individual dots represented by encodings w(d), as well as discourse salience using the dot-level memory M (d) that tracks when the dot d has been mentioned: 8 Dot Configurations. Configuration potentials ψ(r k , z k ) model the correspondence between language features and the set of all active dots in the agent's view for a referent r k . These potentials further decompose into (1) pairwise potentials between active dots in the configuration, which relate the language embedding z k to attribute differences between dots in the pair (including as relative position, size, and shade) and (2) a potential on the entire configuration, which relates the language embedding to an embedding for the count of active dots in the configuration. See Appendix A.1 for more detail.
Configuration Transitions. Transition potentials ω(r k , r k+1 , z k , z k+1 ) model the correspondence between language features and relationships between referring expressions, e.g., to the left of in the black dot to the left of the triangle of gray dots. See Appendix A.1 for details.

Confirmations
When applied to the partner's utterances, the reference resolution module gives a distribution over which referents the partner is likely to be referring to in the agent's own context. If the agent can identify the referents its partner is describing, it should be able to confirm them, both in the dots it talks about next (e.g., choosing to refer to one of the same dots the partner identified) and in the text of the utterances (e.g., yes, I see it). The discretevalued confirmation variable (defined in Section 3) models this, taking the value NA if no referring expressions were identified in the partner's utterance, YES if all of the K > 0 referring expressions have a non-empty referent (at least one dot predicted in the agent's context) and NO otherwise.

Referent Memory
The memory state is composed of one state vector M t (d) for each dot in the agent's own context. These dot states are updated using the referents identified in each utterance. This update is parameterized using a decoder cell, which is applied separately to each dot state: where ι is a function that extracts features from the predictive distribution over referents from the previous utterance, representing mentions of dot d in the referring expressions. We implement the cell using a GRU (Cho et al., 2014). See Appendix A.2 for more details.

Mention Selection
The mention selection subtask requires predicting a sequence of referents to mention in the agent's next utterance, P M (r t+1 | u 1:t , M t+1 , c t+1 , w). To produce these referents, we use the same structured CRF architecture as the reference resolution module P R . However, we use separate parameters from that module, and instead of the referring-expression inputs z use a sequence of vectors x 1:K t+1 produced by a decoder cell RNN M , implemented using a GRU (Cho et al., 2014). The decoder conditions on the dialogue context representation H t from the end of the last utterance, a learned vector embedding for the confirmation variable c t+1 , and a mean-pooled representation of the memory We obtain the number of referents K t+1 by predicting at each step k whether to halt from each x k using a linear layer followed by a logistic function.

Choice Selection
To parameterize the choice selection module P S (s | u 1:T , M T , w), we again reuse the CRF architecture, with independent parameters from reference resolution and mention selection modules, replacing reference resolution's inputs z 1:K with the dialogue context representation H T from the end of the final utterance in the dialogue. Since only a single dot needs to be identified, we use only the CRF's individual dot potentials φ, removing ψ and ω. This is equivalent to the choice selection model (TSEL) used by Udagawa and Aizawa (2020) if the recurrent memory M T is removed.

Utterance Generation
The utterance generation module P U (u t+1 |r t+1 , c t+1 , H t , w) is a sequence-tosequence model. The module first encodes the sequence of dot encodings w (d) for dots in the referents z 1:K t+1 t+1 (predicted by the mention selection module) to produce encodings y 1:K t . Words in the utterance are then produced one at a time using a recurrent decoder that has a hidden state initialized with a function that combines y 1:K t , the dialog context H t , and a learned embedding for the discrete confirmation variable c t+1 . The decoder has two attention mechanisms over: (i) dot encodings w(d), following Udagawa and Aizawa (2020), and (ii) the sequence of encoded referents y 1:K t+1 t . See Appendix A.3 for details.

Experiments
We compare our approach to past systems for the ONECOMMON dataset. While our primary evaluation is to evaluate systems on their success rate on the full dialogue game when paired with human partners (Section 6.4), we also compare our system to past work, and ablated versions of our full system, using the automatic evaluations of past work.

Models
We compare our full system (FULL) to ablated versions of it that successively remove: (i) the referent memory, ablating explicit tracking of referents mentioned (F-MEM) and (ii) the structured potentials ψ, γ in the reference resolution and mention selection modules (F-MEM-STRUC), removing explicit modeling of relationships within and across referents. We also compare to a reimplementation of the system of Udagawa and Aizawa (2020), which we found obtained better performance than their reported results in all evaluation conditions due to implementation improvements. See Appendix A.5. We obtain supervision for all components of the systems by training on the referent-annotated corpus of 5,191 successful human-human dialogues collected by Udagawa and Aizawa (2019; 2020). See Appendix A.6 for training details. We train one copy of each model on each of the corpus's 10 cross-validation splits. We report means and standard deviations across the splits' models, except in human evaluations where we use a single model.

Corpus Evaluation
Following Udagawa and Aizawa (2020), we evaluate models' accuracy at (1) predicting the dot chosen at the end of the game (Choice Acc.) using P S and (2) resolving the referents in utterances from the human partner in the dialogue who had the agent's view (Ref Resolution dot-level accuracy Acc. and exact match accuracy Ex.) using P R .
We see in Table 1 that our FULL model improves substantially on past work, including the work of Udagawa et al. (2020), who augment their referent resolution model with numeric features. Our structured reference resolver is able to learn these features in its potentials ψ (in addition to other structured relationships), and improves exact match from 44% to 76% compared to the ablated version of our system. Our recurrent memory helps in particular for the choice selection task, improving from 71% to 83% accuracy.
We also compare the performance of our full and ablated systems on the tasks of resolving the partner's referring expressions and mention prediction, with results given in Appendix B.

Evaluation in Self-Play
To evaluate systems on the full dialogue task, we first use self-play, where a system is partnered with a copy of itself, following Udagawa and Aizawa (2020). We evaluate systems on 3,000 world contexts, stratified into contexts with 4, 5, and 6 dots overlapping between the two agents' views, with 1,000 contexts in each stratification. Table 2 reports average task success (the fraction of times both agents chose the same dot at the end of the dialogue) averaged across the 10 copies of each model trained on the cross-validation splits. As in the corpus evaluation, we see substantial improvements to our system from the structured referent prediction and the recurrent reference memory. Our Full system, without pragmatic generation, improves over the system of Udagawa and Aizawa (2020) from 51% to 58% in the hardest setting, with a further improvement to 62% when adding our pragmatic generation procedure.

Human Evaluation
Finally, we perform human evaluation by comparing system performance when playing with workers from Amazon's Mechanical Turk (MTurk). To conduct evaluation, we used 100 world states from the #Shared=4 partition, and collected 718 complete dialogues by randomly pairing worker with one of the following three: our best-performing model in self-play (FULL+PRAG), the model from Udagawa and Aizawa (2020), or another worker.
In order to ensure higher quality dialogues, and following Udagawa and Aizawa (2019), we filtered workers by qualifications, showed workers a game tutorial before playing, and prevented dots from being selected within the first minute of the game. We We compare systems based on the percentage of successful dialogues. The results, in Figure 4, corroborate the trends observed in self-play. Both the models of U&A (2020) and our FULL+PRAG perform worse against humans than against agent partners in the automatic self-play evaluation, illustrating the importance of performing human evaluations. However, the trend is preserved, and we see that the FULL+PRAG system substantially outperforms the U&A (2020) model, resulting in a 50% relative improvement in task success rate. This difference is statistically significant at the p ≤ 0.05 level using a one-tailed t-test.

Success by Human Skill Level
In Section 6.4, we compared our systems to a human population of MTurk workers. However, human populations themselves vary greatly based on many factors, including the day and time workers are recruited, training and feedback given to workers, and worker retention. One difference between our worker population and the population that produced the dataset is training. When collecting the dataset, Udagawa and Aizawa (2019) performed manual and individualized coaching of their MTurk workers which made them more effective at the game: giving players personalized feedback on how to improve their game strategies, e.g., "please ask more clarification questions." 9 Manual coaching produced a high-quality corpus by increasing players' skill and obtained a success rate of 66%; 9 Udagawa and Aizawa also manually removed around 1% of dialogues where workers did not follow instructions. While we do not perform post-hoc manual filtering of the dialogues, in order to avoid introducing systematic bias that would favor or disfavor one of the systems we compare, an inspection of a subset of our collected dialogues indicates a similarly high fraction of our workers were making a good effort at the task.

Per-Condition Success
Human --Human Human --Full+Prag Human --U&A'20 Figure 5: Success rates of human players against each system type, and other humans, with progressive filtering of humans by their overall success rate (across all conditions) along the x-axis. Shaded regions give standard errors. Our FULL+PRAG system outperforms past work (U&A 2020) at all levels. 10 however coaching would make human evaluations difficult to replicate across works due to the labor, cost, and variability that coaching involves.
In this section, we run a sweep of system comparisons of the form of Section 6.4, but on increasingly select sub-populations of MTurk workers. Results are shown in Fig. 5. The x-axis gives the minimum skill percentile for a worker's games to be retained (with a worker's skill defined to be their average success across all games; see Appendix D for an alternative), so that the far left of the graph shows all workers (corresponding to the numbers in Fig. 4), the far right shows only those workers who won all of their games, and the black vertical line marks the player filtering needed to obtain a human-human success rate comparable to Udagawa and Aizawa (2019). Our FULL+PRAG system outperforms the model of Udagawa and Aizawa (2020) at all player skill levels. 10 This result shows that, while more accomplished workers' overall success rates can be much higher than the success rate of our general worker population, in all cases the ordering between the two systems remained the same.

Related Work
Goal-oriented dialog. The modular approach that we use reflects the pipelined approach often used in goal-oriented dialogue systems (Young et al., 2013). Recent work on neural systems has also used structured and memory-based approaches (Bordes et al., 2017;He et al., 2018) including tracking entities identified in text (Williams et al., 2017;He et al., 2017). We also find improvements from an entity-centric approach with a structured memory, although our domain involves more challenging entity resolution and generation due to the spatial grounding.
Referring expressions. A long line of past work on referring expression grounding has tackled generation (Dale, 1989;Dale and Reiter, 1995;Viethen et al., 2011;Krahmer and van Deemter, 2012), interpretation (Schlangen et al., 2009;Liu et al., 2013;Kennington and Schlangen, 2015) or both (Heeman, 1991;Mao et al., 2016;Yu et al., 2017). Closest to ours is the work of Takmaz et al. (2020), which builds models for reference interpretation and generation in the rich PhotoBook corpus (Haber et al., 2019), focusing on a non-interactive setting with static evaluation on reference chains extracted from human-human dialogues.
Collaborative games. The closest work on dialogue systems for collaborative grounded tasks has focused on tasks with different properties from ours, as discussed in Section 1. A closely related task to the shared visual reference game we pursue here is the PhotoBook task (Haber et al., 2019), although a dialogue system has not been constructed for it. Other work on grounded collaborative language games includes collection games Pragmatics. Our approach to pragmatics (Grice, 1975) builds on a large body of work in the RSA framework (Frank and Goodman, 2012;Goodman and Frank, 2016), which models how speakers and listeners reason about each other to communicate successfully. The most similar applications to ours in past work on computational pragmatics have been to single-turn grounded reference tasks (rather than dialogue), with much smaller and unstructured spaces of referents than ours, 11 such as discriminative image captioning (Vedantam et al., 2017;Andreas and Klein, 2016;Cohn-Gordon et al., 2018) and referent identification (Monroe et al., 2017;McDowell and Goodman, 2019;White et al., 2020). Explicit speaker-listener 11 Our setting has 2 7 possible referents for each referring expression in the dialogue. models of pragmatics have also been used for dialogue, and while these approaches plan or infer across multiple turns (which our work does not do explicitly), they have either involved ungrounded settings (Kim et al., 2020) or constrained language (Vogel et al., 2013;Khani et al., 2018).

Conclusion
We presented a modular, reference-centric approach to a challenging partially-observable grounded collaborative dialogue task. Our approach is centered around a structured referent grounding module, which we use (1) to interpret a partner's utterances and (2) to enable a pragmatic generation procedure that encourages the agent's utterances to be able to be understood in context. We perform, for the first time, human evaluations on the full dialogue task, finding that our system cooperates with people substantially more successfully than a system from past work and-in aggregateachieves a success rate comparable to pairs of human partners.
While our results are encouraging, there is still much room for improving all systems in their interactions with people on this challenging task.
As the examples in Appendix E illustrate, people use sophisticated conversational strategies to build common ground (Clark and Wilkes-Gibbs, 1986;Traum, 1994;Clark, 1996) when they interact with each other, producing utterances that play multiple conversational roles and performing complex reasoning. To better plan utterances (Cohen and Perrault, 1979) and more accurately infer the partner's state (Allen and Perrault, 1980), we suspect it will be helpful to extend the single-step pragmatic utterance planning and implicit inference procedures that we use here: planning over longer time horizons, performing more explicit reasoning under uncertainty, and learning richer models of the full range of speech acts that people use. Future work might continue to explore these directions on this task and other similarly challenging tests of collaborative grounding.

A Model Details
A.1 Structured CRF Dot Configurations. Dot configuration potentials ψ(r, z) are composed of two terms: R(r, z) which decomposes into functions of pairwise relationships between the dots (whether active or not) in the context w and the text features z, and A(r, z) which is a function of all active dots in the referent: The pairwise relationships are r, z, i, j) where N is the number of dots in view (7) and α is a scalar-valued neural function of the text features and whether the dots indexed by i and j are active in the referent r: p is a 3-dimensional vector produced by an MLP: The active dot potential A is designed to model group properties such as cardinality and common attributes, which other work has found useful on this and similar tasks (Tenbrink and Moratz, 2003;Udagawa et al., 2020). We define the potential as where w(r) is the mean of the feature values for the active dots in r, 1 |r active | d∈r active w(d) and e(r) is a learned 40-dimensional embedding for the discrete count of active dots in r.
Configuration Transitions. The configuration transition potential ω(r k:k+1 , z k:k+1 ) is similar to the dot configuration potential above but bridges the dots in referents k and k+1. It is the sum of two terms: ω(r k:k+1 , z k:k+1 ) = S(r k:k+1 , z k:k+1 ) + B(r k:k+1 , z k:k+1 ). First is S, which decomposes into pairwise relationships between dots across referents r k and r k+1 : β(r k:k+1 , z k:k+1 , i, j) β(r k:k+1 , z k:k+1 , i, j) = q0, r k (i) ∧ r k+1 (j) q1, ¬r k (i) ∧ ¬r k+1 (j) q2, otherwise q (short for q(z k:k+1 , i, j)) is, like p in Dot Configurations, a 3-dimensional vector produced by an MLP: Next is B, which is a function of the feature centroids of the active dots in referents k and k + 1. B is defined analogously to A in Dot Configurations: with w(r) again giving the mean of the feature values for the active dots in r. We fix B(r k:k+1 , z k:k+1 ) = 0 if |r k active | > 3 or |r k+1 active | > 3, which had little effect on model accuracy but improves memory efficiency as it substantially reduces the number of group-pairwise relationships that need to be computed.
Inference. We compute the normalizing constant for the CRF distribution by enumerating the possible 2 7 assignments to each r k to compute the φ, ψ, and ω potential terms, which can be performed efficiently on a GPU. To compute the normalizing constant, which sums over all combinations of assignments to these r k , we use the standard linearchain dynamic program. In training, we backpropagate through the enumeration and dynamic program steps to pass gradients to the parameters of the potential functions.

A.2 Referent Memory
The function ι collapses predicted values for the dot d over K referents into a single representation for the dot, which we do in two ways: by maxand average-pooling predicted values for d across the K referents. We also obtain the prediction values in two ways: by taking the argmax structured prediction from P R , and by taking the argmax predictions from each dot's marginal distribution. We found that using these "hard" argmax predicted values gave slightly better results in early experiments than using the "soft" probabilities from P R . In combination, these give four feature values as the output of ι(d, r t ).

A.3 Utterance Generation Module
We first use a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) to encode the sequence of K referents-to-mention r t+1 = r 1:K t+1 , using the inputs at each position k ∈ [1, K] a mean-pooled representation of the world context embeddings for the active dots in the referent: 1 produce a sequence of encoded vectors y 1:K t . We make gated updates to the decoder's initial state, updating it with (i) a linear projection of the forward and backward vectors for y 1 t and y K t , representing the referent context and (ii) an embedding for the discrete confirmation variable c t+1 .

A.4 Implementation Choices
For our reimplementation of the system of Udagawa and Aizawa (2020) in a shared codebase with our system, we replace their tanh non-linearities with ReLUs and use PyTorch's default initializations for all parameters. These improve performance across all evaluation conditions in comparison to the reported results.
For our system, we use separate word-level recurrent models, a Reader and a Writer, to summarize the dialogue history. The Reader is bidirectional over each utterance, and is used in the reference resolution and choice selection modules. The Writer is unidirectional, and is used in the mention selection and utterance generation modules.  Table 3: Accuracies for resolving referents in the partner's view (dot-level accuracy Acc. and exact match Ex.) and predicting the next referents to mention in the dialogue (Next Refs Ex.) in 10-fold cross-validation on the corpus of human-human dialogues. Our FULL benefits from its recurrent referent memory (outperforming F-MEM) and structured referent prediction module (outperforming F-MEM-STRUC).

A.6 Training Details
For our full system and ablations, we train on each cross-validation fold for 12 epochs using the Adam optimizer (Kingma and Ba, 2015) with an initial learning rate of 1 × 10 −3 and early stopping on the fold's validation set. Our loss function is a weighted combination of losses for the subtask objectives: where w S is a hyperparameter which we set to 1 32 following Udagawa and Aizawa (2020) and we have omitted conditioning contexts from the probability distributions for brevity; see Section 3.2 for the full contexts. We decay the learning rate when the loss plateaus on validation data. We train models on a Quadro RTX 6000 GPU. Training takes around 1 day for models that use the structured CRF, and several hours without the structured CRF. Self-play evaluation takes around 1 hour. Table 3 gives performance accuracies for resolving referents in the partner's view (dot-level accuracy Acc. and exact match Ex.) and predicting the next referents to mention in the dialogue (Next Refs Ex.) in 10-fold cross-validation on the corpus of humanhuman dialogues. We observe improvements from both the recurrent memory (comparing F-Mem to Full) and the structured referent prediction module (comparing F-Mem-Struc to Full) on both tasks.

C Pragmatic Generation
We give pseudocode for the pragmatic generation procedure (Section 4) in Algorithm 1. Figure 3 shows an example, showing 2 referents r (inputs to the realize) function on the left, and 3 utterances u sampled for each referent on the right. Fewer than N r referent candidates may be evaluated (as in Figure 3) if one (r, u) pair is found with L(r, u) ≥ τ .
We used self-play evaluation on one of the crossvalidation splits to tune the early-stopping threshold τ , selecting from among the values {0.0, 0.6, 0.7, 0.8, 0.9}. The optimal value was τ = 0.8, but the success rate in self-play was fairly robust to the value chosen (including τ = 0.0, which results in performing pragmatic search only over those utterances for the single highest-scoring referent sequence under P M ), with a range of about 2%. We did not evaluate without early-stopping (searching over all candidate reference sequences and utterances) as this would have made generation too computationally expensive to be feasible in both self-play and human evaluations.

D Alternative Skill Analysis
In Section 6.5, we compared systems on increasingly select sub-populations of MTurk workers, selected by their average success across all conditions (whether playing with other humans or one

Per-Condition Success
Human --Human Human --Full+Prag Human --U&A'20 Figure 6: Success rates of human players against each system type, and other humans, with progressive filtering of humans by their overall success rate (when partnered with other humans) along the x-axis. Shaded regions give standard errors. Our FULL+PRAG system outperforms the model from Udagawa and Aizawa (2020) at all levels. 10 of the two system types). In this section, we run a similar analysis but select workers by their average success when paired with human partners only. Results are shown in Figure 6. The x-axis gives the minimum skill percentile for a worker's games to be retained, with skill defined by a worker's average success when paired with other human workers. The far left of the graph shows all workers, 12 the far right shows only those workers who won all of their games when paired with other workers, and the black vertical line marks the player filtering needed to obtain a human-human success rate comparable to Udagawa and Aizawa (2019). As we saw in Section 6.5, our FULL+PRAG system outperforms the model of Udagawa and Aizawa (2020) at all worker skill levels. However, focusing on the sub-population of workers who are successful when paired with other humans (the right side of Figure 6) reveals a gap between humans and our system: humans who are successful when partnering with other humans are substantially less successful when partnering with our FULL+PRAG system (and even less successful when partnering with the model of U&A'20). This indicates room for improvement on the task, as we want to build a system that can collaborate as well as humans with any population of human partners.

E Dialogue Examples
We show one successful and one failed dialogue from our human evaluations (Section 6.4) for each system (Figure 7) and from human-human pairs ( Figure 8).
As seen in these examples, descriptions from the baseline system (Figures 7a and 7b) typically have a consistent syntactic structure (e.g., "i have a <size> <color> dot with a <size> <color> dot <spatial relation>") but often do not correspond to the visual context. We suspect that it is difficult for this end-to-end generation model to simultaneously learn which dots to talk about (content selection) and how to describe them (surface realization) with the amount of training data available. Our FULL+PRAG system (Figures 7c and 7d) produces broader and generally more accurate utterances, which we attribute to our factored and pragmatic generation procedure.
Our system's utterances still have substantial qualitative differences from those in human-human dialogues (Figure 8), which-due to the richness of the task (Udagawa and Aizawa, 2019)-often use more complex strategies. Human strategies can unfold across multiple turns, e.g., introducing information in installments or referring to the same dot in multiple turns without being repetitive, as A does when providing more information about the "light grey dot" in Figure 8a. Sophisticated strategies are also used even in single turns, e.g., in Figure 8b, B's utterance "is one on top of the other? if so pick the top one" combines multiple types of speech act (Austin, 1962;Searle, 1976): implicitly acknowledging A's utterance, asserting new information about the dots in view, and issuing a command.