Grounding ‘Grounding’ in NLP

The NLP community has seen substantial recent interest in grounding to facilitate interaction between language technologies and the world. However, as a community, we use the term broadly to reference any linking of text to data or non-textual modality. In contrast, Cognitive Science more formally defines"grounding"as the process of establishing what mutual information is required for successful communication between two interlocutors -- a definition which might implicitly capture the NLP usage but differs in intent and scope. We investigate the gap between these definitions and seek answers to the following questions: (1) What aspects of grounding are missing from NLP tasks? Here we present the dimensions of coordination, purviews and constraints. (2) How is the term"grounding"used in the current research? We study the trends in datasets, domains, and tasks introduced in recent NLP conferences. And finally, (3) How to advance our current definition to bridge the gap with Cognitive Science? We present ways to both create new tasks or repurpose existing ones to make advancements towards achieving a more complete sense of grounding.


Introduction
We as humans communicate and interact for a variety of reasons with a goal. We use language to seek and share information, clarify misunderstandings that conflict with our prior knowledge and contextualize based on the medium of interaction to develop and maintain social relationships. However, language is only one of the enablers of this communication reliant on several auxiliary signals and sources such as documents, media, physical context etc., This linking of concepts to context is grounding and within NLP context is often a knowledge base, images or discourse. In contrast, research in cognitive science defines grounding as the process of building a common ground based on shared mutual information in order to successfully communicate (Clark and Carlson, 1982;Krauss and Fussell, 1990;Clark and Brennan, 1991;Lewis, 2008). We argue that this definition subsumes NLP's current working definition and provides concrete guidance on which phenomena are missing to ensure the naturalness and long term utility of our technologies.
In Section 2, we formalize 3 dimensions key to grounding: Coordination, Purviews and Constraints, to systematize our analysis of limitations in current work. Section 3 presents a comprehensive review of the current progress in the field including the interplay of different domains, modalities, and techniques. This analysis includes understanding when techniques have been specifically designed for a single modality, task, or form of grounding. Finally, Section 4 outlines strategies to repurpose existing datasets and tasks to align with the new richer definition from cognitive science literature. These introspections, re-formulations, and concrete steps situate NLP 'grounding' in larger scientific discourse, to increase its relevance and promise.

Dimensions of grounding
Defining grounding loosely as linking or tethering concepts is insufficient to achieve a more realistic sense of grounding. Figure 1 presents the research dimensions missing from most current work.

Dimension 1: Coordination in grounding
The first and the most important dimension that bridges the gap between the two definitions of grounding is the aspect of coordination -alternatively viewed as the difference between static and dynamic grounding (Fig 2).
Static grounding is the most common type and assumes that the evidence for common ground or the gold truth for grounding is given or attained pseudo-automatically. This is demonstrated in Figure 2 (a). The sequence for this form of interaction includes: (1) human querying the agent, (2) agent querying the data or the knowledge it acquired, (3) agent retrieving and framing a response and (4) agent delivering it to the human. In this setting the common ground is the ground truth KB/data. The human and the agent have common ground by assuming its universality (i.e. no external references). Therefore, successfully grounding the query in this case relies solely on the agent being able to link the query to the data. For instance, in a scenario where a human wants to know the weather report, the accuracy of the database itself is axiomatic and we build a model for the agent to accurately retrieve the queried information in natural language.
Most current research assumes static grounding so progress is measured by the ability of the agent to link more concepts to more data. However, the axiomatic common ground often does not exist and needs to be established in real world scenarios.
Dynamic grounding posits that common ground is built via interactions and clarifications. The mutual information needed to communicate successfully is built via interactions including: Requesting and providing clarifications, Acknowledging or confirming the clarifications, Enacting or demonstrating to receive confirmations, and so forth. This dynamically-established-grounding guides the rest of the interaction by course-correcting any misun-  Figure 2 (b). The steps for establishing grounding is a part of the interaction that includes: (1) The human querying the agent, (2) The agent requesting clarification or acknowledging, (3) The human clarifying or confirming. These three steps loop until a common ground is established. The remaining steps of (4) querying the data, (5) retrieving or framing a response, and (6) delivering the response, are same as that of static grounding. The agent and the human may not be on the same common ground but steps 2 and 3 loop as the conversation progresses to build this common ground. The process of successfully grounding the query not only relies on the ability of the agent to link the query but also to construct the common ground from the mutually shared information with respect to the human. Although there are efforts about clarification questioning (), the coverage of phenomena are still far from comprehensive (Benotti and Blackburn, 2021b). Cognitive sciences in the perspective of language acquisition (Carpenter et al., 1998) present two ways of dynamic grounding via joint attention (Koleva et al., 2015;: Dyadic joint attention and Triadic joint attention. In our case, dyadic attention describes the interaction between the human and the agent and any clarification or confirmation is done strictly between the both of them. Triadic attention also includes a tangible entity along with the human and the agent. The human can provide clarifications by gazing or pointing to this additional piece in the triad.
Summary: The community should prioritize dynamic grounding as it is more general and more accurately matches real experiences.

Dimension 2: Purviews of grounding
Next, we present the different stages behind reaching a common ground, known as purviews. Most of the current approaches and tasks address these purviews individually and independently, while they are often co-dependent in real world scenarios.
Stage 1: Localization: The first stage is the localization of the concept either in the physical or mental contexts. This step is idiosyncratic and relates to the ability of the agent alone to localize the concept. These concepts often are also linked in a compositional form. For instance, consider a scenario in which the agent is to locate a 'blue sweater'. The agent needs to understand each of the concepts of 'blue' and 'sweater' individually and then locate the composition of the whole unit. Clark and Krych (2004) from cognitive sciences demonstrate how incremental grounding (Schlangen and Skantze, 2009;DeVault and Traum, 2013;Eshghi et al., 2015) is performed with these compositions and show how recognition and interpretation of fragments help in this by breaking down instructions into simpler ones. This localization occurs at word, phrase and even sentence level in the language modality and pixel, object and scene level in the visual modality.
Stage 2: External Knowledge: After localizing the concept, the next step is to ensure consistency of the current context of the concept with existing knowledge. Often times, the references of grounding either match or contradict the references from our prior knowledge and external knowledge. This might lead to misunderstandings in the consequent rounds of communication. Hence, in addition to localizing the concept, it is also essential to make the concept and its attributes consistent with the available knowledge sources. Most of the current research is focused on localizing with few efforts towards extending it to maintain a consistency of the grounded concept with other knowledge sources.
Stage 3: Common sense: After establishing consistency of the concept, a human-like interaction additionally calls for grounding the common sense associated with the concept in that scenario. In addition to the basic level of practical knowledge that concerns with day to day scenarios Sap et al. (2020), the concept should also be reasoned based on that particular context. This contextual common sense moves the idiosyncratic sense towards a sense of collective understanding. For instance, if the human feels cold and asks the agent to get a blue coat, the agent needs to understand that the coat in this instance is a sweater coat and not a formal coat. This implicit common sense minimizes the effort in building a common ground reducing articulation of meticulous details. Therefore it is essential to incorporate this explicitly in our modeling as well. Stage 4: Personalized consensus: As a part of the evolving conversations, the references in the language evolve as well. The grounded term might have different meanings for the agent in the context with access to the history as opposed to a fresh agent without access to the history. This multiinstance multi-turn process to achieve consensus makes this collective or a shared stage continually adapting to personalization leading to better engagement (Bohus and Horvitz, 2014). In such settings, it is sufficient that the human and the agent are in consensus with the truth value of the grounded term, which need not be the same as the ground truth. This shift in the truth value of the meanings of the grounded terms often arise due to developing short-cuts for ease of communication and personalization, which is an acceptable shift as long as the communication is successful.
Summary: Common ground requires expanding to verticals of local, general, common-sense and personalized contextual knowledge.

Dimension 3: Constraints of grounding
The medium and mode of communication constrain communicative goals in practical scenarios. The number and availability of such media have increased and facilitated ubiquitous communication around the world, presenting a diversity in the mode of interaction. Motivated by this, we resurface and adapt the constraints of grounding with respect to media of interaction as defined by Clark and Brennan (1991). Here are the definitions of these constraints in the context of grounded language processing and the corresponding categorization of the majority of the representative domains in grounding satisfying different constraints. • Copresence: Agent and human share the same physical environment of the data. Most of the current research in the category of embodied agents satisfy this constraint. • Visibility: The data is visible to the agent and/or human. The domains of images, images & speech, videos, embodied agents satisfy this constraint. • Audibility: Agent and human communicate by speaking about the data. Domains like speech, spoken image captions and videos satisfy this. • Cotemporality: The agent/human receives at roughly the same time as the human/agent pro-duces. The lag in the domains like conversations or interactive embodied agents is considered negligible and satisfy this constraint. • Simultaneity: The agent and the human can send and receive at once simultaneously. Most media are cotemporal but do not engage in simultaneous interaction. This often disrupts the understanding of the current utterance and the participant may have to repeat it to avoid misunderstandings, which is commonly observed in real world scenarios. • Sequentiality: The turn order of the agent and the human cannot get out of sequence. Face-to-face conversations usually follow this constraint but an email thread with active participants and the comments sections in online portals (such as Youtube, Twitch etc.,) do not necessarily follow a sequence. In such cases a reply to the message may be separated by arbitrary number of irrelevant messages. These categories are usually understudied but are commonly observed online. • Reviewability: The agent reviews the common ground to the human to adapt to imperfect human memories. For instance, we reiterate full references instead of adapting to short cut references when the conversation resurfaces after a while. This is to develop a personalized adaptation between the interlocutors based on the media to enable ease of communication.
• Revisability: The interaction between the agent and the human indexes to a specific utterance in the conversation sequence and revise it, therefore changing the course of the interaction henceforth. Human errors are only natural in a conversation and the agent needs to be ready to rectify the previously grounded understanding.
There has been a good and continual effort in formulating tasks and datasets that satisfy the constraints of visibility, audibility and cotemporality. Contemporary efforts also see an increased interest in addressing copresence in grounded contexts. Very recently, (Benotti and Blackburn, 2021a) highlights the importance of recovering from mistakes while establishing the collabrative nature of grounding, contributing to the ability of revisability.
Summary: Key to progress is to focus on largely a blind spot in grounding: simultaneity, sequentiality & revisability to revive from mistakes.

Grounding 'Grounding'
Having covered a more formal definition of grounding adapted to NLP, we turn our attention to cat-aloging the precise usage of 'grounding' in our research community. We present an analysis on the various domains and techniques NLP has explored.

Data and Annotations
To this end, since our aim is to investigate how the community understands the loosely defined term 'grounding', we subselected all the papers that mention terms for 'grounding' in the title or abstract from the S2ORC data (Lo et al., 2020) between the years 1980-2020. In this way, we grounded the term 'grounding' in literature 1 to collect the relevant papers. We acknowledge that the papers analyzed here are not exhaustive with respect to concept of 'grounding'.
Each of the paper is annotated with answers to the following questions: (i)

Domains of grounding
Real world contexts we interact with are diverse and can be derived from different modalities such as textual or non-textual, each of which comprises of domains. Our categorization of these is inspired from the constraints of grounding as described in §2.3. Based on this, the modality based categorization include the following domains: • Textual modality comprising plain text, entities & events, knowledge bases and knowledge graphs. • Non-textual modality comprising images, speech, images & speech and videos.
Numerous other domains including numbers and equations, colors, programs, tables, brain activity signals etc., are studied in the context of grounding at relatively lower scale in comparison to the aforementioned ones. Each of these can further be interacted with along the variation in the coordination dimension of grounding from §2.1, that give rise to the following settings including conversations, embodied agents and face-to-face interactions.

Approaches to grounding
This section presents a list of approaches tailored to grounding. The obvious solution is to expand the datasets to promote a research platform. The  Figure 3: Categorical approaches to grounding second is to manipulate different representations to link and bring them together. Finally the learning objective can leverage grounding. The subcategories within each are presented in Figure 3. 1. Expanding datasets / annotations: The first step towards building an ecosystem for research in grounding is to curate the necessary datasets which is accomplished with expensive human efforts, augmenting existing annotations and automatically deriving annotations with weak supervision. 1a) New datasets: There has been an increase in efforts for curating new datasets with task specific annotations. These are briefly overlaid in Table 1 along with their modalities, domains and tasks. 1b) Augment annotations: These curated datasets can also be used subsequently to augment with task specific annotations instead of collecting the data from scratch, which might be more expensive.
• Non-textual Modality: Static grounding here includes using adversarial references to ground visual referring expressions (Akula et al., 2020), narration (Chandu et al., 2019b(Chandu et al., , 2020a, language learning (Suglia et al., 2020;Jin et al., 2020) etc., • Textual Modality: Static grounding includes entity slot filling (Bisk et al., 2016). • Interactive: Though not fully dynamic grounding, some efforts here are amongst tasks like understanding spatial expressions (Udagawa et al., 2020), collaborative drawing (Kim et al., 2019) etc., 1c) Weak supervision: While the above two are based on human efforts, we can also perform weak supervision to use a model trained to derive automatic soft annotations required for the task. • Non-Textual Modality: In the visual modality, weak supervision is used in the contexts of automatic object proposals for different tasks like spoken image captioning (Srinivasan et al., 2020), visual semantic role labeling (Silberer and Pinkal, 2018), phrase grounding (Chen et al., 2019), loose

Manipulating representations:
Grounding concepts often involves multiple modalities or representations that are linked. Three major methods to approach this are detailed here. 2a) Fusion and concatenation: Fusion is a very common technique in scenarios involving multiple modalities. In scenarios with a single modality, representations are often concatenated. • Non-textual modality: Fusion is applied with images for tasks like referring expressions , SRL (Yang et al., 2016) etc., For videos, some tasks are grounding action descriptions , spatio-temporal QA , concept similarity (Kiela and Clark, 2015), mapping events (Fleischman and Roy, 2008) etc., • Textual Modality: With text, this is similar to concatenating context  perform content transfer by augmenting context). • Interactive: In a conversational setting, work is explored in reference resolution (Takmaz et al., 2020;Haber et al., 2019), generating engaging response (Shuster et al., 2020), document grounded response generation Zhou et al. (2018b), etc., • Others: Nakano et al. (2003) study face-to-face grounding in instruction giving for agents. 2b) Alignment: An alternative to combining representations is aligning them with one another. 3. Learning Objective: Grounding is often performed to support a more defined end purpose task. We identified 3 ways that are broadly adopted to incorporate grounding in objective functions. 3a) Multitasking and Joint Modeling: The linking formulation of grounding is often used as an auxiliary or dependent to model another task. • Non-textual Modality: Multitasking with images is used to perform spoken image captioning (Chrupala, 2019) and grammar induction (Zhao and Titov, 2020). Joint modeling was used in multiresolution language grounding Koncel-Kedziorski et al. (2014), identifying referring expressions , multimodal MT (Zhou et al., 2018c), video parsing , learning latent semantic annotations (Qin et al., 2018) etc., • Interactive: In a conversational setting, multitasking is used to compute concept similarity judgements (Silberer and Lapata, 2014), knowledge grounded response generation (Majumder et al., 2020), grounding language instructions Hu et al. (2019). Joint modeling is used by Li and Boyer (2015) to address dialog for complex problem solving in computer programs. 3b) Loss Function: It is crucial to utilize appropriate loss designed for the specific grounding task. The main difference between multitasking and a loss function adaptation is that while multitasking reweights combinations of existing loss functions, novel loss functions are informed by the data/task at hand, adapting to a novel use case.

Analysis of trends
Based on the categories of approaches and different datasets from §3.3, we presented a representative set of analyses that highlight the major avenues that addressing the key missing pieces of work on grounding to advance future research. Figure 4 presents the trends in the development of grounding over the past decade including: specific approaches (a,b) that presents new tasks/challenges; world scopes (Bisk et al., 2020) (c) contributing to grounding language in different  (d) contributing to a part of linguistic diversity. We also present hierarchical pie charts in Figure 5 and in Appendix to analyze the compositions of modalities and domains for these approaches.While we believe our analysis targets several of the most critical dimensions paving way for future research directions, it is not exhaustive and welcome suggestions from the community for additional analysis. For example, it is also interesting to study domain diversity, task formulation/usefulness, etc., in future. Trends in datasets expansion: The introduction of new datasets has seen a rapid increase over the years, while there is also a subtle increasing trend in augmenting annotations to the existing datasets, as observed in Figure 4 (a). As we can see from Figure  5 (a), across all the domains, gathering new datasets seem to be prominent than augmenting them with additional annotations to repurpose the data for a new task. There seems to be a higher emphasis of expansion of datasets in the non-textual modalities, particularly in the domain of images. A similar rise is not observed in interactive settings including conversational data and interaction with embodied agents; which is the propitious way to bridge the gap towards real sense of grounding. It is indeed encouraging to see an increasing trend in the efforts for expanding datasets but the need of the hour is to redirect some of these resources to address dynamic grounding in the coordination dimension which is scarcely studied in existing datatsets. Trends in manipulating representations: From Figure 4 (b), we note that the fusion technique has and is increasingly becoming popular in grounding through manipulating representations in comparison to alignment and projection. This is also observed in Figure 5 (b) with the dominance of nontextual modality. In the context of textual modality, this technique is equivalent to concatenation of the context or history in a conversation. Projecting onto a common space is the next popular technique in comparison to alignment. Similarly, we observe that the non-textual modality overwhelmingly occupies the space of manipulating representations with exceeding prominence of fusion. Fusion and projecting onto common space currently are exceedingly used methodologies to ground within a single purview. They demonstrate a promising direction to manipulate representations across different stages to maintain consistency along the purviews. Trends in World Scopes: We also study the development of the field based on the definitions of the world scopes presented by Bisk et al. (2020). Based on this, last decade has seen an increasing dominance in research on world scope 3 (world of sights and sounds). However, this is limited to this scope and the same trend is not clear in world scope 4 (world of embodiment and action). An encouraging observation is the focus of the field in world scope 5 (social world) which is closer to real interactions in the last year. We need to accelerate development of datasets and tasks in world scopes 4 and 5. It is highly recommended to take dynamic grounding scenario into account in the efforts for shows that research into grounding in multiple languages is still incredibly rare. As noted by Bender (2011), improvements in one language do not necessarily mandate comparable performances in other languages. The norm for benchmarking large scale tasks still remains anglo-centric and we need serious efforts to drift this trend to identify challenges in grounding across languages. As a first step, a relatively less expensive way to navigate this dearth is to augment the annotations of existing datasets with other languages.

Path Ahead: Towards New Tasks and Repurposing Existing Datasets
We presented the dimensions of grounding that require serious attention to bridge the gap between the definitions in cognitive sciences and language processing communities in §2. Based on this, we analyzed the language processing research to understand where we stand and where we fall short with the ongoing efforts in trends in grounding in §3. While we strongly advocate for efforts in building new datasets and tasks considering progress along these dimensions, we believe in a smoother transition towards this goal. Hence we present strategies to repurpose existing resources to maximum utility as we stride towards achieving grounding in real sense. In this section, we focus on concrete suggestions to improve along each of the dimensions.
Coordination: This is based on simulating interaction for dynamic grounding. As establishing a common ground is not integrated within datasets, we propose an iterative paradigm to explicitly settle on a common ground based on our priors.
The first family of methods to perform this is human-in-the-loop interactions. The traditional methods of data collection do not cater to human feedback or generation. Some recent approaches to incorporate human feedback are during data collection (Wallace et al., 2019), training (Stiennon et al., 2020), inference (Hancock et al., 2019). While the feedback in a human in the loop setting can be via scores, we argue for natural language feedback (Wallace et al., 2019) loop, which resembles human-human grounding via communication.
The second family of methods are inspired from the theory of mind (Gopnik and Wellman, 1992) to iteratively or progressively ask and clarify to establish a common ground (Roman et al., 2020). de Vries et al. (2017); Suglia et al. (2020) disambiguate or clarify the referenced object through a series of questions in a guessing game. This iterative paradigm can be related to work by  that generates clarification questions and answers to incorporate in the task of question answering. This loop of semi-automatic generation of clarifications establishes a common ground. This is also in spirit similar to generating an explanation or a hypothesis for question answering (Latcinnik and Berant, 2020). The process of generating an acceptable explanation to human before acts as establishing a common ground.
We believe that datasets and tasks along the following 3 directions encourage dynamic grounding: (1) conversational language learning (Chevalier-Boisvert et al., 2019) or acquisition, and (2) clarification questioning and ambiguity resolution  (3) mixed initiative for grounding in conversations (Morbini et al., 2012). The need of the hour that can revolutionize this paradigm is the development of evaluation strategies to monitor evolution of the common ground. This dynamic grounding data helps improve performance/robustness and encourages human's trust while using these interactive systems.
Purviews: This is based on establishing consistency across stages of grounding with an incremental paradigm. A simple solution is a modular approach where the purviews flow into the next stage after reasonably satisfying the previous stage. The current benchmarking approaches are mostly lateral i.e., our current strategies collate multiple datasets of a single task to benchmark. This approach implicitly establishes boundaries between the purviews. In contrast, we advocate for a longitudinal approach for benchmarking i.e in addition to collating different datasets for a task, we also extend the purviews of the task such that the output from the previous purview flows into the next purview. An example of establishing a longitudinal benchmark for visual dialog. The tasks flow from object detection (stage 1: localization) to knowledge graphs (stage 2: external knowledge) to common sense understanding (stage 3: common sense) to empathetic dialogue (stage 4: personalization) for the same dataset. This helps us dissect which aspect of grounding is the model good and bad at to understand the weak areas.
Constraints: With media imposed constraints, there is a need for paradigm shift in the way these datasets are curated. The optimal way to navigate this problem is curating new datasets to specifically focus on the less studied constraints of simultaneity, sequentiality and revisability. At the heart of revisability in a collaborative dialog is clarification questioning and resolving ambiguities (Boni and Manandhar, 2003;Rao and III, 2018;Braslavski et al., 2017;Kumar and Black, 2020;Aliannejadi et al., 2020;Benotti and Blackburn, 2021b) However, they are rarely explored and are not systematically standardized across modalities. Transferring knowledge for shared constraints across tasks is a promising way to leverage the existing datasets.
Augment with multilingual annotations: Different languages also bring novel challenges to each of these issues (e.g. pronoun drop dialogue in Japanese, morphological alignments, etc). However, as observed in §3.4, the increase in expanding datasets is not proportionally reflected to include multiple languages. We recommend a relatively less expensive process of translating the datasets for grounding into other languages to kick start this inclusion. The research community has already seen such efforts in image captioning with human annotated German captions in Multi30k (Elliott et al., 2016) extended from Flick30k (Plummer et al., 2015) and Japanese captions in STAIR (Yoshikawa et al., 2017) based on MS-COCO images (Lin et al., 2014). Instead of using human annotations, some efforts have also been made to use automatic translations such as the work by Thapliyal and Soricut (2020)

Conclusions
We discussed the missing pieces and dimensions that bridge the gap between the definitions of grounding in Cognitive Sciences and NLP communities. Thereby, we chart out executable actions in steering existing resources along 3 dimensions to achieve a more realistic sense of grounding. Specifically: (1) Static grounding still remains the central tenet for existing tasks and datasets. However, dynamic grounding is key moving forward. (2) Current benchmarking strategies evaluate model generalization. In tandem, we also need to steer towards longitudinal benchmarking to naturally proliferate across purviews of grounding that is closer to human interactions. (3) Constraints imposed by the interaction medium present nuanced categories of communicative goals. While discerning learning from shared constraints, we also urge the community to invest resources on revisability as a way to recover from contextually mistaken groundings. While ruminating on the above phenomena, the challenge of expanding them to multiple languages and domains still persists. We also recommend systematic evaluation of grounding along these dimensions in addition to the existing linking capabilities.

Ethical Considerations
The analytical and ontological discussion here focuses exclusively on the question of grounding and common ground and does not address the harmful biases inherent in these datasets. Further, the common ground for which we are advocating is culturally specific and future work that introduces tasks and data for these purposes must be explicit about who they serve (culturally and linguistically). Static Grounding: In static grounding, when you ask an agent "Can you place the dragon fruit on the rack"?, the agent links the entities and places the dragon fruit on the rack. The challenge here is mainly the linking part which is crucial to ensure it accurately understood the instruction.

References
Dynamic Grounding: The same is not true for dynamic grounding. There are primarily 2 ways to materialize this. First, with respect to language learning: What if the agent does not know dragon fruit? The agent needs to first ask "What is a dragon fruit?", and the human provides an answer. Lets say the human responded by describing the physical attributes such as reddish pink fruit and/or a spatial reference by refering to it as the fruit on the bottom left. The important aspect here is that the agent asks and learns what a dragon fruit is and use this knowledge later. The second is ambiguity resolution. Consider a scenario where there are multiple racks. It is very natural for a human to ask the agent which rack to resolve ambiguity.We expect the same from the agent to ask a clarifying question to resolve ambiguity and then place it on the second rack.
Purviews -Localization: Consider this example of a conversation between an agent and a human.
Human: What is the name of the role Robert Downey Jr played in Avengers? Agent: He played the role of Tony Stark, and sometimes is also referred to as Iron Man.
The agent begins by localizing and linking Robert Downey Jr to Tony Stark and Iron Man to provide the appropriate answer to the query.
Purviews -External Knowledge: However, natural conversations also extend beyond the purview of localization to discuss a broadened scope involving external knowledge of the context including entities, actions etc., For example, consider this conversation which seems to be a natural continuation to the earlier one.
Human: Is he the head of SHIELD? Agent: Tony Stark has never been the head of SHIELD in the movies but has been the acting head upon Maria Hill's suggestion in the Comics.
Once we localized Tony Stark, asking additional information like whether he is the head of SHIELD is natural in conversations; However, access to required external knowledge is rarely present in the datasets as well as evaluated. Here, we need to refer to external sources spanning from movies to comics to conclude that he has been the acting head in the comics but was never in the movies.
Purviews -Common sense: One of the branches of natural progression to this context can extend to the following turns: Human: How long was the contract between Tony Stark and Marvel? Agent: Tony Stark is the name of the character in Marvel. Would you like to know the contract length for Robert Downey Jr who played the role?
Here, the agent needs to understand that Tony Stark is not a real person, but is a character in Marvel. Hence, any contract is with the actor but not the character who played the role. The agent needs to have the common sense to understand this and clarify the question.
Purviews -Personalization: Upon a continous exchange regarding this topic (and perhaps a few other times earlier), the agent needs to adapt and personalize to the interacting human over time.
Human: Can you give me any movie suggestions? Agent: Yes, since you like Disney movies and seem interested in Robert Downey Jr, would you like to watch "Dolittle"?
Having discussed about Robert Downey Jr in prior contexts and retaining from the prior interactions that the human likes Disney movies, when the human asks about a movie recommendation, the agent continually learns and contextually suggests Robert Downey Jr's Disney movie "Dolittle" as a recommendation.

Constraints -Copresence:
Modality is an important medium that affects communicative goals and the nature of interaction. Here is an example in a copresent environment.
Human: I want to play with my cat. Can you get me the ball on your right?
In the above example, the human and the agent  Table 2: Constraints of grounding along with their medium of communication (Clark and Brennan, 1991) are copresent in the same environment. The above utterance for instance, includes executable actions in the environment along with references being either person-centric or agent-centric.
Constraints -Visibility: Certain communications like in the cases of visual question answering or visual dialog only presents a visible medium to interact about. The interaction requires information from an image or a video, but does not necessarily include executable actions or cater to external knowledge of the information. For example, with an access to an image a human can ask a question like the following: Human: How many peaks are there in those mountain ranges?
Constraints -Audibility: This modality constrains the information scope to be within speech signals that are only heard and do not contain any visual or copresent information. Table 2 presents the constrainst of grounding.

B Further survey and categories
Here is a brief elaboration of the datasets presented in Table 1.
New datasets: The first solution to curate the entire dataset with annotations designed for the task.  (Zhou et al., 2018b), improvisation (Cho and May, 2020), problem solving (Li and Boyer, 2015), spatial reasoning in a simulated environment (Jänner et al., 2018), navigation (Ku et al., 2020) etc., In addition, there are several other techniques used to ground phenomenon in real world contexts.
Here is the technique wise representation of these categories of models in the literature.

C Prevelance of modailties and constraints
Here is the distribution of the papers studying various tasks based on the constraints imposed by the medium.  As we can see, a major concentration of these efforts lie in grounding visual and textual media, while a few cater to audibility i.e speech signals. Papers studying dialog are the main representatives of the constraints for sequentiality and co-temporality.

D Nuanced modeling variations for grounding
Here is a more nuanced and finer grained categorization of the various modeling techniques used in literature for grounding. Figure 8 presents these categories in depth. As discussed in the paper, most of the literature is focused on grounding in static visual modality. Attention based methods dominate the rest of the methods in both textual and non-textual modalities closely followed by graph based methods as observed in these trends. This is not an exhaustive study of all the techniques that present grounding, but are some of the representative categories. Here are more studies that perform grounding with various techniques such as clustering (Shutova et al., 2015;Cardenas et al., 2019) regularization (Shrestha et al., 2020, CRFs (Gao et al., 2016), classification (Pangburn et al., 2003;Monroe et al., 2017), linguistic theories (Strube and Hahn, 1999), iterative refinement (Li et al., 2019;Chandu and Black, 2020), language modeling (Spithourakis et al., 2016;Cho and May, 2020), nearest neighbors (Kiela et al., 2015), contextual fusion (Chandu et al., 2019a), mutual information (Oates, 2003), cycle consistency (Zhong et al., 2020) etc.,