Conversational Semantic Parsing using Dynamic Context Graphs

In this paper we consider the task of conversational semantic parsing over general purpose knowledge graphs (KGs) with millions of entities, and thousands of relation-types. We focus on models which are capable of interactively mapping user utterances into executable logical forms (e.g., Sparql) in the context of the conversational history. Our key idea is to represent information about an utterance and its context via a subgraph which is created dynamically, i.e., the number of nodes varies per utterance. Rather than treating the subgraph as a sequence, we exploit its underlying structure and encode it with a graph neural network which further allows us to represent a large number of (unseen) nodes. Experimental results show that dynamic context modeling is superior to static approaches, delivering performance improvements across the board (i.e., for simple and complex questions). Our results further confirm that modeling the structure of context is better at processing discourse information, (i.e., at handling ellipsis and resolving coreference) and longer interactions.


Introduction
General purpose knowledge graphs (KG), like Wikidata (Vrandečić and Krötzsch, 2014) structure information in a semantic network of entities, attributes, and relationships, allowing machines to tap into a vast knowledge base of facts.Knowledge base question answering (KBQA) is the task of retrieving answers from a KG, given natural language questions.A popular approach to KBQA (see Gu et al. 2022 and the references therein) is based on semantic parsing which maps questions to logical form queries (e.g., in SPARQL) that return an answer once executed against the KG.
Existing work (e.g., Bogin et al. 2019;Ravishankar et al. 2021;Yin et al. 2021) has mostly focused on answering questions in isolation, whereas we consider the less studied task of conversational semantic parsing.Specifically, our interest lies in building systems capable of interactively mapping user utterances to executable logical forms in the context of previous utterances.Figure 1 shows an example of a user-system interaction, taken from SPICE (Perez-Beltrachini et al., 2023), a recently released conversational semantic parsing dataset.
Each interaction consists of a series of utterances that form a coherent discourse and are translated to executable semantic parses (in this case SPARQL queries).Interpreting each utterance, and mapping it to the correct parse needs to be situated in a particular context as the exchange proceeds.To answer the question in utterance 2, the system needs to recall that Mathias Kneissl is still the subject of the conversation, however, the user is no longer interested in who starred in the film but in who directed it.It is also natural for users to omit previously mentioned information (e.g., through ellipsis or coreference), which would have to be resolved to obtain a complete semantic parse.
In addition to challenges arising from processing contextual information, the semantic parsing task itself involves linking entities, types, and predicates to elements in the KG (e.g., Mathias Kneissl to Q3298576) whose topology is often complex with a large number of nodes.Moreover, unlike relational databases, the schema of an entity is not static but dynamically instantiated (Gu et al., 2022).For example, the entity type person can have hundreds of relations but only a fraction of these will be relevant for a specific utterance.Therefore, to generate faithful queries, we cannot rely on memorization and should instead make use of local schema instantiation.In general, narrowing down the set of entities and relations is critical to parsing utterances requiring complex reasoning (i.e., where numerical and logical operators apply over sets of entities).
Existing work (Perez-Beltrachini et al., 2023) handles the aforementioned challenges by adopting various simplifications and shortcuts.For instance, since it is not feasible to encode the entire KG, only a subgraph relevant to the current utterance is extracted and subsequently linearized and treated as a sequence.Entity type information that is not directly accessible via neighboring subgraphs is obtained through a global lookup (essentially a reverse index of all types in the KG).This solution is computationally expensive, as the lookup is performed practically for every user utterance, and does not scale well (the index would have to be recreated every time the KG changed).
In this paper we propose a modeling approach to conversational semantic parsing which relies on dynamic context graphs.Our key idea is to represent information about an utterance and its context through a dynamically generated subgraph, wherein the number of nodes varies for each utterance.Moreover, rather than treating the subgraph as a sequence, we exploit its underlying structure and encode it with a graph neural network (Scarselli et al., 2009;Gori et al., 2005).To improve generalization, we learn implicit node embeddings by aggregating information from neighboring nodes whose embeddings are in turned initialized through a pretrained model (Devlin et al., 2019).In addition, we introduce context-dependent type linking, based on the entity and its surrounding context which further helps with type disambiguation.
Experimental evaluation on the SPICE dataset (Perez-Beltrachini et al., 2023) demonstrates that modeling context dynamically is superior to static approaches, improving performance across the board (i.e., for simple and complex questions requiring comparative or quantitative reasoning).Our results further confirm that modeling the structure of context is better at processing discourse information, (i.e., at handling ellipsis and resolving coreference) and longer interactions with multiple turns.

Related Work
Previous work on semantic parsing for KBQA (Gu et al., 2022) has focused on mapping stand-alone utterances to logical form queries. Various approaches have been proposed to this effect which broadly follow three modeling paradigms.Ranking methods first enumerate candidate queries from the KB and then select the query most similar to the utterance as the semantic parse (Ravishankar et al., 2021;Hu et al., 2021;Bhutani et al., 2019).Coarse-to-fine methods (Dong and Lapata, 2018;Ding et al., 2019;Ravishankar et al., 2021) perform semantic parsing in two stages, by first predicting a query sketch, and then filling in missing details.Finally, generation methods (Yin et al., 2021) first rank candidate parses and then predict the final parse by conditioning on the utterance and best retrieved logical forms.
Our task is most related to conversational textto-SQL parsing, as manifested in datasets like SParC (Yu et al., 2019b), CoSQL (Yu et al., 2019a), and ATIS (Dahl et al., 1994;Suhr et al., 2018).SParC and CoSQL cover multiple domains and include multi-turn user and system interactions.These datasets are challenging in requiring generalization to unseen databases, but the conversation length is fairly short and the databases relatively small-scale.ATIS contains utterances paired with SQL queries pertaining to a US flight booking task; it exemplifies several long-range discourse phenomena (Jain and Lapata, 2021), however, it covers a single domain with a simple database schema.
Graph-based methods have been previously employed in semantic parsing primarily to encode the database schema, so as to enable the parser to globally reason about the structure of the output query (Bogin et al., 2019).Other work (Hui et al., 2022) uses relational graph networks to jointly represent the database schema and syntactic dependency information from the questions.In the context of conversational semantic parsing, Cai and Wan (2020) use a graph encoder to model how the elements of the database schema interact with information in preceding context.Along similar lines, Hui et al. (2021), use a graph neural network with a dynamic memory decay mechanism to model the interaction of the database schema and the utterances in context as the conversation proceeds.All these approaches encode the schema of relational databases which are significantly smaller in size (e.g., number of entities and types of relations) compared to large-scale KGs, where encoding the entire graph in memory is not feasible.
Closest to our work are methods which cast conversational KGQA as a semantic parsing task (Kacupaj et al., 2021;Marion et al., 2021).These approaches build hand-crafted grammars that are not directly executable to a KG engine.Furthermore, they assume the KG can be fully encoded in memory which may not be feasible in real-world settings.Perez-Beltrachini et al. (2023) develop a parser which is executable with a real KG engine (e.g., Blazegraph) but simplify the task by considering only limited conversation context.

Problem Formulation
Given a general purpose knowledge graph, such as Wikidata, our task is to map user utterances to formal executable queries, SPARQL in our case.We further assume an interactive setting where users converse with the system in natural language and the system responds while taking into account what has already been said (see Figure 1).The system's response is obtained upon executing the query against a graph query engine.
Let G denote the underlying KG and I a single interaction.I consists of a sequence of turns where each turn is represented by ⟨X t , A t , Y t ⟩ denoting an utterance-answer-query triple at time t (see blocks in Figure 1).A user utterance X t is a sequence of tokens ⟨x 1 , x 2 , . . ., x |Xt| ⟩, where |X t | is the length of the sequence and each is a natural language token.Query string Y t is a sequence of tokens ⟨y 1 , y 2 , . . ., y |Yt| ⟩, where |Y t | is the length of the sequence and each is a either a token from the SPARQL syntax vocabulary (e.g., SELECT, WHERE) or a KG element ∈ G (e.g., Q3298576).Answer A t at time t is the result of executing Y t against G.Given the interaction history I[: t − 1] at turn t and current utterance X t , our goal is to generate Y t .This involves understanding X t in the context of X :t−1 , A :t−1 , and G, and learning to generate Y t based on encoded contextual information.

Model
Our modeling approach combines three components.We first ground named entities in the user utterance to KG entities, and use these linked entities to extract a subgraph that functions as context (Section 4.1).The second component is responsible for type linking in the context of current and previously mentioned named entities (Section 4.2).And finally, our semantic parser learns to map user utterances into SPARQL queries (Section 4.3).

Entity Grounding and Disambiguation
We are interested in grounding user utterance X t to graph G. Since encoding the entire KG is not feasible, we extract a subgraph from G which is relevant to the current turn.To achieve this, we first perform named entity recognition with an off-theshelf NER system, in our experiments we use Al-lenNLP (Gardner et al., 2018).We perform named entity linking through efficient string matching1 (Aho and Corasick, 1975) unlike Perez-Beltrachini et al. (2023) who deploy an ElasticSearch server for querying an inverted index.
A string can be ambiguous, i.e., link to multiple entities.For example, Rainer Werner Fassbinder can be linked to filmmaker (Q44426) and movie (Q33561976).To deal with ambiguity and to increase recall, Perez-Beltrachini et al. (2023) do not ever commit to a single entity but instead include the top-K matching ones; however, this introduces noise and increased computational cost.Instead, we disambiguate entities based on their popularity in the training set (Shen et al., 2015) and compare the two approaches in Section 6.
Following Perez-Beltrachini et al. (2023), for each identified KG entity e, we extract triples (e, r, o) and (s, r, e), where s and o denote the subject or object of relation r.For instance, entity Dubashi would have triple (Dubashi, country of origin, India).Subjects and objects are further mapped to their types in place of actual entities  (i.e., (e, r, o type ) and (s type , r, e)).In our example, the triple for Dubashi then becomes (Dubashi, country of origin, country), where country is type Q6256.We denote the set of typed triples as G ent t .

Context-Dependent Type Linking
Entities in SPARQL queries have types, for example, in Figure 1, KG element Q502895 is a placeholder for the type "common name".Type instances are often present in the one-hop entity neighborhood G ent t , but can also be more hops away.Perez-Beltrachini et al. ( 2023) index all KG types and perform a global lookup which is computationally expensive, and solely applicable to the KG they are working with.Instead, we perform type linking based on the entities mentioned in the current context.We expand the grounded entities to extract triples with type information.2Since considering multi-hop neighborhoods would lead to extremely large subgraphs and would not be memory efficient, we prune these based on their string overlap with the user utterance, significantly reducing the number of triples.The pruned graph G type t is merged with the previously obtained entity graph

Dynamic Context Graph Model
Figure 2 shows the overall architecture of our dynamic context graph model (which we abbreviate to DCG).DCG takes as input a tuple of form ⟨C t , X t , G t ⟩, where X t is a user utterance at time t and G t is the corresponding subgraph.C t denotes the previous context information that includes the previous utterance X t−1 , the previous answer A t−1 , and subgraph G t−1 .We use Ĝt to represent merged context subgraphs G t and G t−1 such that Ĝt = G t ∪ G t−1 .We encode the context subgraph Ĝt with a graph neural network (GNN, Scarselli et al. 2009;Gori et al. 2005) and user utterances and their discourse context with BERT (Devlin et al., 2019).Our decoder is a transformer network (Vaswani et al., 2017) that conditions on the user utterance encoding and the corresponding graph representations.
Utterance Encoder We use BERT3 (Devlin et al., 2019) to represent the concatenation of previous utterance X t−1 , previous answer A t−1 , and current utterance X t (see Figure 2).To distinguish between current and past context we use the [SEP] token.More formally, let Xt ) are sequences of natural language tokens.We obtain latent representations Z t as Z t = BERT( Xt ).

Graph Encoder
We represent the KG subgraph Ĝt at time t as a directed graph G = (V, E) (hereafter, we simplify notation and drop time t), where V = {v 1 , v 2 , . . ., v n } are nodes such that v i ∈ {entities, relations, types} and E ⊆ V × V.
Each node v i consists of a sequence of natural language tokens, such that v i = ⟨v i1 , v i2 , . . ., v i|v i | ⟩.Our KG has a large number of distinct nodes, but we cannot possibly attest all of them during training.To handle unseen nodes at test time, we obtain a generic node representa-tion h 0 i for node v i , where h 0 i = AVG(BERT(v i )).In other words, we compute encoding h 0 i by taking the average of the individual token encodings obtained from BERT.We do not create a separate embedding matrix but directly update the BERT representations during learning, which allows us to scale to a large number of (unseen) nodes.
A graph neural network (GNN) learns node representations by aggregating information from its neighboring nodes.Each GNN layer l, takes as input node representations and edges E. The output of each layer is an updated set of node representations {h l i | i ∈ [1, n]}.We use Graph Attention Network v2 (GATv2, Brody et al. 2022) for updating node representations which replaces the static attention mechanism of GAT (Velickovic et al., 2018) with a dynamic and more expressive variant.Let N i = {v j ∈ V | j, i ∈ E} denote the neighbors of node v i and α ij the attention score between node h i and h j .We calculate attention as a single-layer feedforward neural network, parametrized by a weight vector a and weight matrix W : The scoring function ψ is computed as follows: where • T represents transposition and ∥ is the concatenation operation.Attention coefficients corresponding to each node i are then used to compute a linear combination of the features corresponding to neighboring nodes as: Decoder Our decoder is a transformer network (Vaswani et al., 2017).Let H l t = h l 1t , h l 2t , . . ., h l nt denote the sequence of node representations from the last layer of the graph network (recall t here represents an interaction turn).Our decoder models the probability of generating a SPARQL parse conditioned on the graph and input context representations, i.e., p(Y t | H l t , Z t ).Generating the SPARQL parse requires generating syntax symbols (such as SELECT, WHERE) and KG elements (i.e., entities, types, and relations).Given decoder state s i at the i th generation step, the probability of generating y i is calculated as: where p G and p S are the probability of generating a graph node and syntax symbol, respectively.We calculate p S = softmax(W where ŷi denotes the predicted output token at decoder step i.

Experimental Setup
Dataset We performed experiments on SPICE (Perez-Beltrachini et al., 2023), a recently released large-scale dataset4 for conversational semantic parsing built on top of the CSQA benchmark (Saha et al., 2018).SPICE consists of user-system interactions where natural language questions are paired with SPARQL parses and answers provided by the system correspond to SPARQL execution results (see Figure 1).We present summary statistics of the dataset in Model Configuration Our model is implemented using PyTorch (Paszke et al., 2019) and trained with the AdamW (Loshchilov and Hutter, 2019) optimizer. 5Model selection was based on exact match accuracy on the validation set.We used two decoder layers and two GATv2 layers for all experiments.We used HuggingFace's pretrained BERT embeddings (Wolf et al., 2020), specifically the uncased base version.Our GATv2 implementation is based on PyTorch Geometric (Fey and Lenssen, 2019) with two attentions heads.We use adjacency matrices stacking as a method of creating mini-batches for our GNN across different examples.We identify named entities using the AllenNLP named entity recognition (NER) system (Gardner et al., 2018).Our execution results are based on the Wikidata subgraph provided by Perez-Beltrachini et al. (2023).Our SPARQL server is deployed using Blazegraph. 6See Appendix B for more implementation details.As described in Section 4.3, at each utterance, our model encodes the previous t subgraphs.Larger context is informative but can also introduce noise.We treat t as a hyperparameter and optimize its value on the development set.We report results with t = 5 (see Appendix A for details).
Comparison Models We compare against the semantic parser of Perez-Beltrachini et al. (2023).Their model is based on BERT (Devlin et al., 2019), it relies on AllenNLP (Gardner et al., 2018) for named entity recognition, and performs entity link-ing with an ElasticSearch7 inverted index.As mentioned earlier, they do not explicitly perform named entity disambiguation (they consider the K = 5 best matching entities and their neighborhood graphs as part of the vocabulary) and use a global lookup for type linking.As the size of the linearized subgraph often exceeds BERT's maximum number of input tokens (which is 512), they adopt a workaround where the graph is chunked into several subsequences, and encoded separately.We refer to their Semantic Parser as BertSP GL , where GL is a shorthand for Global Lookup.
In addition to our full dynamic context graph model which performs Context-dependent Type Linking (DCG CL ), we also build a simpler variant (DCG) which only relies on the entity neighborhood subgraph for type information.Moreover, we create two variants of our model, one which disambiguates entities, and another one which does not (similar to Perez-Beltrachini et al. 2023).

Results
In this section, we evaluate the performance of our semantic parser on the SPICE test set.We report results on individual question types and overall.We also analyze our system's ability to handle different discourse phenomena like ellipsis and coreference as well as interactions of varying length.instead (DCG CL ).We find that context-dependent type linking is slightly worse than global lookup which is expected given that it does not have access to the full list of KG types.

The Effect of Dynamic Context
In general, we observe that results with exact match (EM) are lower than F1 or Accuracy.EM is a stricter metric, it does not allow for any deviation from the goldstandard SPARQL.However, it is possible for two queries to have different syntax but equivalent meaning, and for partially well-formed queries to evaluate to partially accurate results.In contrast to EM, F1 and Accuracy give partial credit and thus obtain higher scores.
The Effect of Entity Disambiguation We now present results with a variant of our model which operates over disambiguated entities (compare second and fifth blocks in Table 2, with heading DCG CL ).We observe that disambiguation has a significant effect on model performance, leading to an F1 increase of more than 11%.We further assess the utility of context-dependent linking by comparing DCG CL to a variant which does not have access to the type graph G type t , neither during training nor during evaluation (see column DCG, With Disambiguation).This type-deficient model performs overall worse both in terms of F1 and EM, but is still superior to BertSP GL , even though the latter has access to more information via the top-K entity lookup and global type linking.This points to the importance of encoding context in a targeted manner rather than brute force.In Appendix A we discuss the effect of context length on the availability of type information.The Effect of Conversation Length We next examine the benefits of modeling context dynamically.Ideally, a model should produce an accurate semantic parse no matter the conversation length.Figure 3 plots exact match accuracy (averaged across question types) against different turn positions.In general, we observe that utterances occurring later in the conversation are more difficult to parse.As the dialogue progresses, subsequent turns become more challenging for the model which is expected to leverage the common ground established so far.This involves maintaining the subgraph context based on the conversation history in addition to handling linguistic phenomena such as coreference and ellipsis.Overall, DCG CL (with and without entity disambiguation) is superior to BertSP GL , and the gap between the two models is more pronounced for later turns.the phrase do for a living is elided from (Q2) but can be understood in the context of (Q1).Coreference on the other hand, occurs between utterances that refer to the same entity.For example, between utterances 2 and 3 in Figure 1.
In Table 3, using exact match, we assess how different models handle ellipsis and coreference across question types.Coref=−1 refers to cases where coreference can be resolved in the immediate context, i.e., the previous turn.Coref< −1 involves utterances that require access to wider conversation context, beyond the previous turn.In the setting that does not disambiguate entities, we observe that models which exploit discourse context (variants DCG GL and DCG CL ) are better at resolving coreferring and elliptical expressions compared to BertSP GL .We also see that entity disambiguation is very helpful, leading to substantial improvements for DCG CL across discourse-related phenomena.
Similar to Perez-Beltrachini et al. (2023), we also evaluate model performance on utterances with plural mentions; these are typically linked to multiple entities which the semantic parser must enumerate in order to build a correct parse (MEntities in Table 3).DCG CL with disambiguation is overall best, while DCG (without type linking) is worse.This is not surprising, utterances with multiple entities generally have complex parses, with multiple sub-queries and entity types, which DCG does not have access to.
The Nature of Parsing Errors Overall, we find that our model is able to predict syntactically valid SPARQL queries.Errors are mostly due to misinterpretations of the question's intent given the graph context and previous questions or missing information.Our model also has difficulty parsing Clarification and Quantitative Reasoning questions.For Clarification questions, it is not able to select the right entity after clarification.For example, in the following conversation: Failures in type linking are a major cause of errors for Quantitative Reasoning questions which typically have no or very limited context (e.g., "Which railway stations were managed by exactly 1 social group ?").However, our model relies on the availability of types in the entity neighborhood, as it performs type linking in a context dependent manner.We observe that it becomes better at parsing such questions when given access to all KG types (see Table 2, DCG GL vs. DCG CL ).

Conclusions
In this paper, we present a semantic parser for KBQA which interactively maps user utterances into executable logical forms, in the context of previous utterances.Our model represents information about utterances and their context as KG subgraphs which are created dynamically and encoded using a graph neural network.We further propose a context-dependent approach to type linking which is efficient and scalable.
Our experiments reveal that better modeling of contextual information improves performance, in terms of entity and type linking, resolving coreference and ellipsis, and keeping track of the interaction history as the conversation evolves.Directions for future work are many and varied.In experiments, we use an off-the shelf NER system, however, jointly learning a semantic parser and an entity linker would be mutually beneficial, avoiding error propagation.Given that it is prohibitive to encode the entire KG, we encode relevant subgraphs on the fly.We could further explicitly model the relationship between KG entities and question tokens which previous work has shown is promising (Wang et al., 2020).Finally, it would be interesting to adapt our model so as to handle non-i.i.d generalization settings.

Limitations
Our model relies on a pre-trained NER module for entity linking.As this module is trained and evaluated on specific datasets, its performance may not generalize on unseen domains within Wikidata.Moreover, we did not explicitly consider relations.We assume that the correct information will be available which may not always be the case.We focus on encoding KG structural information and pass the learning of the interactions between the KG and the linguistic utterances to the decoder.As shown in previous work (Zhang et al., 2022), effectively combining KG information with a language model can be mutually beneficial in the context of question answering.However, it requires an extensive study in itself to determine the task specific parametrization (Wang et al., 2022).

A The Effect of Context Length
Figure 4 plots the performance of DCG CL (disambiguation setting) against progressively increasing context length ∈ [1, 10].We observe that access to wider context is beneficial up to a point.Performance deteriorates with very long contexts (beyond turn position 5).We stipulate two reasons for this.Firstly, longer interactions might be long because users ask about more than one entity or topic, in which case local context might be sufficient to provide an answer.And secondly, longer interactions might be genuinely confusing and noisy for annotators to create, let alone models.We further assess how context length interacts with the availability of type information.Table 4 shows the difference in performance with and without explicit type linking at context lengths 1 and 5.As described in Section 5, DCG does not have explicit type linking while DCG CL uses contextdependent linking, while both models apply entity disambiguation.∆ F1 1 is the absolute difference in F1 score between DCG CL and DCG for context length 1.Similarly, ∆ F1 5 denotes the difference for context length 5.
Overall, we find a significant drop in performance for context length 1 compared to context length 5.This indicates that more type information becomes available with increased context length.However, performance varies with question types.Specifically, the exact match difference is lot bigger for Clarification questions compared to Quantita- tive Reasoning questions which seem to require access to larger KB subgraphs.

B Model Details
Our model is implemented using Py-Torch (Paszke et al., 2019) and trained with the AdamW (Loshchilov and Hutter, 2019) optimizer.It was trained with an A100 GPU with a batch size of 64 and an initial learning rate of 0.001.AdamW coefficients β 1 and β 2 (used for computing running averages of gradient and its square) were set to 0.9 and 0.999, respectively.W The weight decay coefficient was set to 0.01 for all experiments.Hyperparameters were set based on initial experiments using a manually selected grid.We did not tune learning rate parameters.We choose the number of GATv2 and decoder layers from [1,4] and found 2 to work best.Our SPARQL query server was deployed using Blazegraph. 8which uses only CPU-based resources and has access to 100G of RAM.
We use two attention heads with GATv2.Specifically, let K denote the attention head as computed (Velickovic et al., 2018) in Equation (3).The output of each head is concatenated as follows: where ∥ represents concatenation.α k ij are normalized attention coefficients computed by the k-th attention mechanism as in Equation (1).
Our graph is represented as an adjacency matrix.To create a mini-batch, adjacency matrices 8 https://blazegraph.com/are diagonally stacked (Fey and Lenssen, 2019).This creates a combined graph that holds multiple isolated subgraphs as shown below: where n is the batch-size number of graphs.Node input H and target H features are simply concatenated in the node dimension as follows:

Figure 2 :
Figure 2: Model architecture.The previous and current utterance are concatenated and their subgraphs are merged and encoded in a graph neural network.The subgraphs represent the entity neighborhood and type linking.
Figure 3: Exact Match Accuracy averaged across question types at different turn positions.+/−D denotes the presence/absence of entity disambiguation.
Who starred in Mathias Kneissl ?[SEP ] Rainer Werner Fassbinder... [SEP] Who was the director of that work of art ?
and |V s | is the SPARQL syntax vocabulary size.We calculate p G using node embeddings H l t , as p G = softmax(H l t s i ).Training Our model is trained end-to-end by optimizing the cross-entropy loss.Given training instance ⟨C t , X t , Y t , G t ⟩, where Y t is a sequence of gold output tokens ⟨y 1 , y 2 , . . ., y |Yt| ⟩, we minimize the token-level cross-entropy as:

Table 1 .
As can be seen, it contains a large number of training instances, the conversations are relatively long (the average turn length is 9.5), and the underlying KG is sizeable (12.8M entities).SPICE has simple factual questions but also more complicated ones requiring reasoning over sets of triples; it also exemplifies various discourse-related phenomena such as coreference and ellipsis.We provide examples of the types of questions attested in SPICE in Appendix C.

Table 1 :
Statistics of the SPICE dataset.

Table 2 :
(Perez-Beltrachini et al., 2023)e first concentrate on model variants without entity disambiguation for a fair comparison with Perez-Beltrachini et al. (2023).We compare DCG GL , a version of our model which adopts a global lookup for type liking similar to BertSP GL and differs only in how contextual information is encoded.As we can see, our graphbased model performs better, reaching an F1 score of 72.3% compared to 59% obtained by BertSP GL which is limited by the way it encodes contextual information.Recall, that BertSP GL linearizes the subgraph context, splits into subsequences and feeds it to the model chunk-by-chunk.Our model alleviates this problem by efficiently encoding the KG information with a graph neural network, preserving dependencies captured in its structure.As a result, DCG GL performs better on most question types, including simple questions, and reasoning-style questions.We further compare DCG GL against a variant which uses context-dependent type linking Results on SPICE dataset (test set).BertSP GL(Perez-Beltrachini et al., 2023)uses NER based on AllenNLP) and global look-up (subscript GL ) for type linking.DCG CL uses context-dependent type linking (subscript CL ) and also AllenNLP.DCG has no type linking.We measure F1, Accuracy (AC), and Exact Match (EM).

Table 4 :
Interaction of context length and type linking.∆ F11 is the absolute difference in F1 score between DCG CL and DCG for context length 1. ∆ F15 is the absolute F1 difference for context length 5.