Dialogue Meaning Representation for Task-Oriented Dialogue Systems

Dialogue meaning representation formulates natural language utterance semantics in their conversational context in an explicit and machine-readable form. Previous work typically follows the intent-slot framework, which is easy for annotation yet limited in scalability for complex linguistic expressions. A line of works alleviates the representation issue by introducing hierarchical structures but challenging to express complex compositional semantics, such as negation and coreference. We propose Dialogue Meaning Representation (DMR), a pliable and easily extendable representation for task-oriented dialogue. Our representation contains a set of nodes and edges to represent rich compositional semantics. Moreover, we propose an inheritance hierarchy mechanism focusing on domain extensibility. Additionally, we annotated DMR-FastFood, a multi-turn dialogue dataset with more than 70k utterances, with DMR. We propose two evaluation tasks to evaluate different dialogue models and a novel coreference resolution model GNNCoref for the graph-based coreference resolution task. Experiments show that DMR can be parsed well with pre-trained Seq2Seq models, and GNNCoref outperforms the baseline models by a large margin.


Introduction
A task-oriented dialogue (TOD) system aims to serve users to accomplish tasks in specific domains with interactive conversations. The modeling of converting natural language semantics in their conversational context into a machine-readable structured representation, also known as the dialogue meaning representation, is at the core of TOD system research. A meaning representation framework sets the stage for natural language understanding (NLU) and allows the system to communicate amazon-research/dialogue-meaning-representation

Customer:
Yes, I need a large coke, two green stripe added with extra cheese.  Figure 1: An example of a customer's utterance for food ordering. The flat intent-slot schema can not align the food items ("coke" and "green stripe") with the modifiers ("large", "a", "two" and "extra cheese") in conjunction constructions. These multiple conjunctions and modifiers require a good meaning representation to reveal the relations between attributes (e.g., size, quantity) and their corresponding entities. Here, we propose DMR with an example shown in the lower part of the figure, a meaning representation for TOD, which can resolve such compositional semantics.
with other downstream components such as the databases or web service APIs. Derived from the theoretical framework of Fillmore (1968), and wildly adopted in dialogue system designs as early as Bobrow et al. (1977), the classic flat intent-slot schema represents an utterance into one specific intent with several associated slots. Such schemas are convenient in annotation but limited in the expression for compositional semantics, such as conjunction, modification, negation, and coreference across dialogue turns. These complex patterns are not at all uncommon in realworld use cases. Take the fast-food domain data from MultiDoGO dataset (Peskov et al., 2019) as an example, where an agent is required to extract information about ordering food (e.g., food name, quantity, size, and ingredients) from the conversation with the customer, about 12.3% of the utterances contain multi-intent and 11.2% are involved with coreference semantics. However, the flat intent-slot schema leaves these semantics uncovered. Moreover,8.7% of the ordering utterances contain multiple objects and modifiers. For example in Figure 1, with the flat intent-slot schema, the extracted food items "coke" and "green stripe" (a pizza name) cannot be aligned with the size "large", the quantities "a" and "two", and the ingredient "extra cheese".
The vanilla flat intent-slot schema needs to be extended to support compositional semantics. Gupta et al. (2018) proposes TOP in the form of a hierarchical parsing tree to allow representation for nested intents. Its successor, SB-TOP (Aghajanyan et al., 2020), further simplifies the structure and supports coreference. Cheng et al. (2020) introduces TreeDST, which is also a tree-structured dialogue state representation to allow high compositionality and integrate domain knowledge. These studies imply the trend pointing to better expression for complex compositional semantics via structure adaptation.
This paper proposes Dialogue Meaning Representation (DMR) that significantly extends the intent-slot framework. DMR is a rooted, directed acyclic graph (DAG) composed of nodes of Intent, Entity and pre-defined Operator and Keyword, as well as edges between them. Entity is an extension of slot which wraps the slot value with the specific slot type defined in external knowledge. Such design allows arbitrary complex compositionality between slots and keeps the potential for type constraint. Operator and Keyword are components to represent linguistic knowledge (i.e. general semantics) such as conjunction, negation, quantification, coreference, etc. The details of DMR will be described in Section 2, and the example comparing DMR with flat intent-slot representation is shown in Figure 1. As described later in Section 2, many of the key designs are inspired by AMR (Banarescu et al., 2013) but specialized for TOD. Thus DMR can be considered a dialect of AMR. From this perspective, DMR is powerful enough and easily extendable.
Moreover, DMR is capable of adapting to different domains. Unlike previous works, DMR utilizes a domain-agnostic ontology to define the structural constraints and representations of general semantics. It allows chatbot developers beneficial to derive domain-specific ontology from this for their applications through the Inheritance Hierarchy mech-anism. This design improves both generalization and normalization of DMR.
To validate our idea, we propose a dataset, DMR-FastFood, with 7194 dialogues and 70328 annotated utterances. This dataset is extensively annotated with more linguistic semantics, including 16087 conjunctions and 557 negations, significantly more than other related datasets. We developed and evaluated a few baseline models to pinpoint where the challenges lie. We further propose GNNCoref for the coreference resolution task on DMR. In general, DMR parsing is not difficult, especially with a pre-trained model, though graphs with more complex (and often deeper) structures are naturally more challenging. Moreover, experiments show that GNNCoref performs better compared to baseline models.

Dialogue Meaning Representation
This section describes the structure of the DMR graph, the domain-agnostic ontology, and the representation of general semantics.

DMR Ontology
DMR ontology defines the nodes, the edges, and the rules for constructing DMR graphs. It also describes the inheritance hierarchy mechanism. DMR is a rooted directed acyclic graph with node and edge labels. Figure 2 shows an example of a DMR graph from the fast-food domain. . Different from the general purpose of the predicates and concepts defined in AMR, DMR utilizes the specialized designed nodes for TOD, and the edges are defined by the nodes they link.

Nodes
There are four types of nodes in DMR: • Intent, denotes the intention of the speaker, such as OrderIntent in Figure 2; • Entity, denotes the objects mentioned in the utterance. Generally, it is formed with "<lexical _ value> <canonical _ value> Entity", where <lexical _ value> specifies the surface form value from the utterance, and <canonical _ value> is the predefined value for the entity in ontology. <lexical _ value> and <canonical _ value> are optional. 2 ; • Operator, supports compositional constructions, such as and for conjunction, and reference for cross-turn coreference. Details are described in the next section; • Keyword, specifies keywords for some special semantics, such as "-" for negation.
Each node (except keywords) are assigned with an identifier as AMR does, such as "v1" for node OrderIntent in Figure 2. And the root node of a DMR graph is restricted to be an Intent node or a conjunction operator and for packing multiple intentions.

Edges
The nodes in DMR graph are linked with directed edges. The outgoing edges of a node are the arguments of this node. All types of nodes have predefined arguments in the ontology, but some arguments may not appear in a specific DMR graph. For example, in fast food domain, intent OrderIntent has one argument order-item (see Figure 2). Entity type DrinkItem has pre-defined arguments quant, mod and ingredient, but in the example of Figure 2, no ingredients of the coke are mentioned in the utterance, thus the edge ingredient is not shown in the graph.

Inheritance Hierarchy
DMR is featured with inheritance hierarchy. With this mechanism, chatbot developers can derive domain-specific ontology easily and organize it hierarchically. For example, in the fast-food domain, we can derive the ontology like: It defines three intents OrderIntent, PaymentIntent and ThankYouIntent; two base entity types FoodItem and DrinkItem; and three FoodItem types Pizza, Burger and Sandwich that inherits from FoodItem.
With inheritance hierarchy, the domain-specific and domain-agnostic knowledge are well separated: the general semantics that is common in all domains, such as conjunction, quantification, negation, and coreference, are defined in the domainagnostic ontology, while the domain-specific part inherits these representations and thus can focus on the application. Further, it reduces the burden of constructing ontology, as the intent and entity types inherit their parents' arguments by default, and the ontology is organized hierarchically.

Compositional Semantics
Here we describe the general compositional semantics defined in the domain-agnostic ontology. It is worth noting that we do not cover all the general semantics in TOD, though this set can be extended in the future. The examples used are taken from the fast-food domain; we just show a sub-graph of the DMRs and omit the variables for simplicity.
Modification refers to the semantics where a specific adjective modifies some entities. Modifiers are non-essential descriptive content compared to arguments (Dowty, 1982(Dowty, , 1989(Dowty, , 2003, such as the size or color of an object. The modification semantics is expressed by mod, for example: Quantification is also a common semantics in TOD. Quantification is expressed by the edge labeled with quant. T3 in Figure 3 shows such case similar to the following example: Cross-Turn Coreference is a common phenomenon in dialogues. Since DMR represents semantics into a graph, the implementation of coreference in DMR is to link corresponding nodes, rather than simple text mentions. DMR introduces a special operator Reference and edge refer to represent cross-turn coreferences. The reference node is with form "reference <lexical _ value>", along with an argument refer that points to another node. For example, T7 in Figure 3 contains a reference node "v2" that points to "T:5 N:v4", which means this node refers to the node "v4" in the T5 turn's DMR. Negation is the construction which ties a negative polarity to another element, reversing the state of an affair or discontinuing an act. For instance, the utterance "Please cancel the burger" conveys a cancel action to an order. Inspired by AMR, we notice that negation can be seen as a binding act attached to an element. Therefore, instead of representing negation via additional Intent, DMR resolves negation by edge polarity and keyword "-". For example, T9 in Figure 3, node "v4" is negated. Moreover, one tricky issue about negation is its scope. A negation act to an order item can be confused as one to an (enclosing) order intent, leading to an unintended "overkill". At this stage of development, we make a simplification and restrict the negation to attach only to Entity.

Related Work
There is a rising interest on developing more flexible representation for TOD other than the slots representation (Bobrow et al., 1977). In this section, we briefly introduce them and compare the most related works with DMR.
AMR Using AMR for semantic parsing has been studied from a very early time (Banarescu et al., 2013). There are several works that apply AMR to dialogue systems. Bai et al. (2021) model dialogue state with AMR for chit-chat. Bonial et al. (2019) extended AMR for human-robot dialogues, and further formalize it as Dialogue-AMR (Bonial et al., 2020(Bonial et al., , 2021. Dialogue-AMR represents both the illocutionary force and the propositional content

Customer:
Oh, sorry, I want no toppings, instead order me a soda. that's all.

T9
Customer: emmm, please add salted butter and onions on it.

Customer:
Okey, one sandwich.  of the utterance. Compared to these works, DMR focuses on TOD specifically with extended node types for TOD description. DMR is intent-centric, and only captures semantics defined in the ontology of the intents. Further, the design of inheritance hierarchy aims at a better domain generalization to support a broad range of applications.
Compositional Intent-Slot Some recent works focus on the compositional intent-slot framework, such as TOP (Gupta et al., 2018), SB-TOP (Aghajanyan et al., 2020) and TreeDST (Cheng et al., 2020). These formalizations are much more powerful than the flat intent-slot schema. Generally, they focus less on how to get representations for different domains and give fewer descriptions on how nodes/edges are connected. On the contrary, the ontology which contains both domain-agnostic and domain-specific parts allows DMR to be applied and extended to different domains while assuring the maintenance of the representation structure based on formal grammar. From this point of view, DMR is designed to provide services to different businesses. This different focus of the application scenarios marks the key difference between DMR and these compositional intent-slot representations.
Programs Many efforts have been devoted to explore the representation in programs (Price, 1990;Zelle and Mooney, 1996;Zettlemoyer and Collins, 2005;Liang et al., 2011). Though powerful expressiveness, they are hard in annotation which makes it limited in proposing large-scale dialogue datasets.
Recently, some works such as SMCalFlow (Semantic Machines et al., 2020) and TOC (Campagna et al., 2021) and ThingTalk (Lam et al., 2022) proposes to use executable dialogue states for TOD. To this end, the representation itself is also a speciallydesigned programming language. While executing, both database operations and response generation are performed by the program at the same time. DMR keeps the dialogue state architecture, and leaves the implementation of business logic to the user applications.

Data
We use the fast food domain data from the Multi-DoGO dataset proposed by Peskov et al. (2019). We annotate all the customers' utterances with the redefined ontology. We call the annotated dataset DMR-FastFood. This dataset contains 7k annotated dialogues, and each dialogue has 18.5 turns on average, which is much more than other datasets. Further, there are 7k references, 16k conjunctions and 557 negations annotated. The annotation process, statistics and comparison with related datasets are described in Appendix A.

NLU with DMR
As NLU tasks of the flat intent-slot representation including Intent Classification and Slot Filling, NLU tasks under the DMR framework are to extract DMR graphs from the customer's utterances. Given a customer's utterance x i and the dialogue context (x 0 , · · · , x i−1 ), the NLU tasks are to predict the DMR graph g i . In this section, we introduce NLU tasks with DMR and proposed models.

Tasks
Though most of the DMRs can be predicted by a semantic parsing model, the turns that have crossturn coreferences, namely referring turns, are not the case. The reference nodes in the referring turns need to be resolved to link to their referent nodesnodes that are assigned with variables -in DMRs from the dialogue context. Parsing DMRs and resolving coreferences for the referring turns at the same time is not a trivial task. Thus, in this work we split NLU with DMR into two sub-tasks: DMR Parsing and Coreference Resolution.
DMR Parsing aims to parse a customer utterance into a DMR graph, without resolving the reference nodes. This semantic parsing task is similar to the NLU task in most related works, including TOP, SB-TOP, TreeDST and SMCalFlow.
Coreference Resolution resolves the reference nodes predicted by DMR parsing. Differing from traditional text-based coreference resolution, which links referring expressions to their antecedents in texts, this task is defined as follows: for each reference node n r in a referring turn's DMR g t , the task is to predict whether n r and a given candidate node n c ∈ {g 0 , · · · , g t−1 } are coreferred.

Models
Our overall framework is composed with two stages. One is to parse graphs from the utterance by a Seq2Seq model, then is to resolve coreferences based on a GNN model.

DMR Parsing Model
The conventional approach to semantic parsing task is the Seq2Seq architecture which inputs the utterances and outputs the linearized tree or graph. This approach is also applied by SMCalFlow, SB-TOP, and Rongali et al. (2020). We utilize Seq2Seq architecture for DMR parsing as well, and restrict the decoder vocabulary to get more reasonable results. The details of our Seq2Seq model are described as follows: Input We concatenate the utterances to be parsed and the dialogue context -which are the customer's and agent's utterances in previous turns -to form the model input according to their order. Specifically, the model takes the input sequence r i−c ||x i−c , · · · , r j ||x j , · · · , r i−1 ||x i−1 , r i ||x i , where x i is the customer's utterance to be parsed; x j is the customer's or agent's utterance in the dialogue context; c is the context size. r j is the role tag which can be = customer: or agent:, for the customer's and the agent's turn, respectively.
Output The output sequence of turn i is the linearized form of g i . To linearize g i takes three steps: 1) remove the refer edges of each reference node since the coreference resolution model will resolve them (see Section 5.2.2), 2) remove the variables of nodes and will assign them back in the postprocessing step for resolving references, and 3) convert the graph to a bracket expression. For example, the following DMR: Since the <lexical _ value>s are from the utterance contents, we constrain the decoding step to only generate tokens from either the schema or the utterance x i . In our model, we mask the probabilities of non-relevant tokens to zeros in the output distribution at each decoding step. The output sequence is then parsed to DMRs with a shift-reduce parser, and the nodes are assigned with variables with Depth-First-Search.
Post-processing Though we restrict the decoder vocabulary, it does not guarantee the predicted sequence could be parsed to a valid graph because the sequence could be an invalid bracket expression. To tackle this issue, a rather flexible shift-reduce parser is applied to get a valid bracket expression by adding missing brackets or removing redundant brackets. If this rescue fails, the prediction is set to OutOfDomainIntent. And then we assign variables to the nodes in the recovered DMR graphs.

Coreference Resolution Model
For the graph-based coreference resolution task, we propose a GNN-based model GNNCoref. The following equations show how the model works: (2) p(corefer|n r , n c ) = Classifier(n r , n c ) (3) First, for each referring turn t, a Dialogue Graph G t is built; then the dialogue graph is encoded by  : Dialogue Graph for GNNCoref model. The example used here is for resolving reference nodes (red colored nodes) in T9. The black arrows are edges within DMRs; the orange arrows are edges from DMR nodes to the turn nodes; the blue arrows are inter-turn edges that link DMRs through turn nodes; and the green arrow is the edge for linking resolved coreferences in the context. a GNN encoder to encode the node features, the encoded graph is denoted as G t ; n r and n c in G t are the encoded node features of the reference node and candidate node respectively, they are input into a binary classifier to predict whether they are coreferred. Next, we describe details of these three modules.
Build Dialogue Graph The dialogue graph G t is built by connecting the DMRs (g 0 , · · · , g j , · · · , g t ) according to their order in the dialogue. Figure 4 shows the dialogue graph for resolving references in T9 in Figure 3. Specifically, we first build a Turn Graph for each turn with its DMR graph structure, and link each node to a turn-level global node which we call turn node with edge labeled turn-edge. The turn graphs are included in the dashed boxes in the figure. Then each turn node (e.g. for g j ) points to its k-hop ancestor (g j−k ) with the edge labeled k-hop. These inter-turn edges connects the DMRs to form one connected dialogue graph. Finally, if there are reference nodes already resolved in the dialogue context, e.g. the reference node in T7 in the figure, they are connected to their referent nodes with edge refer. In this way, the coreference resolution for the current referring turn depends on the previously resolved references, which brings more information for the task. Additionally, we add inverse edges for each edge to allow the message to pass bidirectionally. In dialogue graphs, every two turn graphs are linked through the turn nodes. Since the turn node links to all the nodes in that turn, every node in the dialogue graph can connect to each other after a three-step message passing so that all the context information can be encoded to the nodes.

GNN Encoder
The graph is encoded with a 3layer Relational Graph Convolutional Network (R-GCN) (Schlichtkrull et al., 2018) to encode the edge information to the nodes. The GNN encoder is designed to enhance the message passing among nodes and edges so that a global context information can be captured in this process.
Classifier The binary classifier is a Multilayer Perceptron (MLP) with a Sigmoid activation for output. In the inference stage, we set a threshold β to determine the predictions. In our experiments, the value of β is tuned on the development set (see Appendix B.2 for details).
For a given reference node n r , treating all the nodes in the context as its candidates is unwise, because n r has the same entity type as its referents. However, the reference nodes are not labeled with types. According to our annotation guideline for the DMR-FastFood dataset described in A.2, all the reference nodes have the same incoming edge as the referents, thus we choose the nodes in the context with the same incoming edge (or have the same incoming edge if there are more than one incoming edges) as n r to be its candidates.

Experiments
We report experiments for DMR Parsing and Coreference Resolution separately, and the combined results on the complete DMR graphs. Further, we analyze the key factors that affect the model performance for the two tasks.  used are listed in Appendix B.

DMR Parsing
We use Exact Match accuracy to measure between predicted DMRs and the ground truths to evaluate the DMR Parsing results. To match the graphs semantically, we utilize the Smatch metric  designed for AMRs. Two DMRs are exactly matched if their Smatch score equals 1. Table 1a shows the DMR Parsing results. The best results are achieved by BART-base model which are more than 10 points over the other two models, showing a well pretrained Seq2Seq model is essential for this task. We use error analysis to explore the difficulties and room for improvement of the DMR Parsing task. The details are listed in Appendix C. The main conclusions are intuitive: 1) DMR Parsing is dependent on the dialogue context, and 2) longer utterances, deeper and larger DMR graphs make the parsing task harder.

Coreference Resolution
To show the effectiveness of the proposed Coreference Resolution model, we compare the results with a heuristic rule-based method and a MLP baseline model. The rule-based method selects the last DMR graph in the context and select the candi-date nodes in this DMR as the predicted referents. This distance-based heuristic is commonly used as an important feature in coreference resolution (Bengtson and Roth, 2008). In the MLP model, the features of the reference node and candidate nodes are the average of the word embeddings of their one-hop neighbor in the DMR graph and their own, and the features of the reference node and candidate node are concatenated to input into a 2layer MLP classifier. For GNNCoref model, the initial node features for entity nodes and reference nodes are the average of GloVe6B-100d embeddings (Pennington et al., 2014) of all tokens ( except the variable) in the node. Other nodes are symbols defined in the DMR-FastFood ontology and their embeddings are randomly initialized. In our experiments, we use the DGL (Wang et al., 2019) implementation of R-GCN. All the methods are trained and evaluated on ground truth DMRs. We measure coreference resolution with accuracy, i.e., a reference node is resolved correctly if the predicted turns and nodes are the same as the ground truth. Note that about 31.2% of the reference nodes in DMR-FastFood dataset have only one candidate which is directly their referent, we ignore these cases during training and evaluation. The results are listed in Table 1b. We can see that simple heuristic rule can't handle this task well. Also, GNNCoref outperforms MLP well indicating the global dialogue context information captured with the graph structure is very useful compared to the local one-hop features.

The Overall NLU Performance
Combining the predictions of the DMR Parsing and Coreference Resolution model, we get the complete DMR graphs. The exact match of the complete DMRs are shown in Table 1c, the coreference resolution predictions used here is by GNNCoref reported in Table 1b. Comparing to the parsing results in Table 1a, the performance drops less than two points which proves the effectiveness of the two-step approach to this NLU task.

Ablation Study of GNNCoref model
We conduct ablation studies to investigate the effectiveness of the two key designs in the dialogue graph for GNNCoref: 1) the global nodes Turn Node connecting DMRs through turn-level, and 2) depending on resolved coreferences in the context by adding refer edges as described in Section 5.2.2. We remove the turn nodes by connecting  the DMRs through their root nodes instead, and remove the dependence on the resolved coreferences by removing refer edges.
In order to more rigorously prove the importance of the dialogue context information, we further conduct experiments of less R-GCN layers for GN-NCoref. The results are shown in the last two rows of the table. Theoretically, With a 2-layer R-GCN, the nodes can only see information within their turns, thus no dialogue context information is captured; the 1-layer R-GCN only captures one-hop information for each node. We can see that the performances are declined which indicates that less captured context leads to lower performance.

Conclusion
We focus on the representation with the expression ability of both complex compositional semantics and task-oriented semantics. The previous works are either limited in intent-slot schema so that can not handle complicated semantics or short in taskoriented semantics (e.g. AMR). To handle these issues, we propose DMR, a represantation extended from AMR which is capable of complex linguistic constructions with a high transferability across domains. We define a set of ontology focusing on four kinds of nodes and three kinds of edges. Moreover, we design the inheritance hierarchy which allows reusing, extending and inheriting node types to enable DMR scale to different domains easily. We conduct experiments on DMR parsing and coreference resolution tasks. Experimental results show that pre-trained Seq2Seq models could improve the DMR parsing results. We also propose a graphbased model for the coreference resolution task. Additionally, we release a large dataset to incentive research on semantics parsing, which contains more than 70k utterances annotated with rich linguistic semantics.

Limitations
In this work, we propose DMR and a dialogue dataset on the fast-food domain annotated with it. For the consideration of limitations, we would make detailed descriptions as follows: (1) Dataset issue. DMR is designed to support a broad range of domains and applications for taskoriented dialogue. However, because of the human resources, and the observation that enough compositional semantics to begin with, such as conjunction, modification, and negation are contained in fast-food domain data, our dataset is only annotated on the fast-food domain for now.
(2) Dialogue state. We apply DMR for the NLU task in this work. How to represent dialogue states with DMRs is not addressed here.
(3) Annotation efforts. Annotating data with DMRs is more expensive than annotating with intents and slots. Few-shot learning of DMR may be a promising research problem yet we do not cover it in this work.

A Data Annotation and Statistics
The data annotation process has two stages: 1) DMR graph annotation and 2) reference annotation.

A.1 DMR Annotation
In this stage, the annotators draw DMR graphs for each customer turn. They also annotate the referred turn number for the reference nodes. We developed an annotation tool based on GoJS 3 for quickly drawing DMR graphs. As shown in Figure 5a, the right part of the tool shows the utterances of dialogue up to the current turn, and the left part is the area for drawing graphs.
Before the annotation, the annotators followed a detailed guideline and took a training process. They draw the DMR graph in the diagram by adding nodes and linking them together. We enable the graph drawing process to follow the schema, which guarantees the validity of the resulting DMR.
To ensure the annotators do not hallucinate node values, the annotators must either select nodes in the bottom, or copy tokens from the current utterance (toknized by Spacy 4 ) to fill the node. Also, there are sanity checks before saving the annotations to the database. After the graph annotation, we assign variables to the nodes in each DMR.

A.2 Reference Annotation
In this stage, the annotators are given the current turn's DMR and the referred turn's DMR, and they need to annotate the referents for each reference nodes. The tool is modified as Figure 5b, the left part has two diagrams. The below diagram shows the DMR graph for the current turn, and there are reference nodes there. When clicking the reference node, the referred DMR graph will appear in the above diagram, and the annotator can select the referents in it. Further, we constrain the annotator to only select referents with the same incoming edges as the reference nodes.

A.3 Quality Control
We make some efforts to ensure the high quality of the dataset.
First, we ask the annotators to fix the typos in the utterances during the annotation process.
Second, in some dialogues, reference nodes appear in the first customer turn, mainly due to the customer ordering toppings but no food items. We remove these dialogues from the data.
The third is double annotation. Though the annotators are well-trained experts, we have 10% of the dialogues double annotated. We compute Fleiss' kappa (Fleiss, 1971) for measuring the Inter-Annotator Agreement (IAA). After cleaning the annotation, 5,159 utterances have valid double annotations, and the IAA is 0.748, which is a substantial agreement.

A.4 Data Statistics
The statistics of DMR-FastFood are listed in Table 3, we split the dataset as the original setting.   We add marks and exclude the following utterances: first, we omit utterance-annotation pairs in the train set that also occur in the dev and test set, for including them will cause information leakage; and second, a portion of the data annotated with a single intent are excluded from the dev and test set, since they are more like text classification and trivial to get right. The left data is used for NLU. In Table 3, "Utterance for NLU", "NLU DMR Depth" and "NLU DMR Nodes" are statistics based on the utterances for NLU.
We also compare DMR-FastFood with related open source datasets in Table 4. Though DMR-FastFood is not the biggest dataset, it has more turns per dialogue, longer utterance content, and more explicit annotations of negation and conjunction.

B.1 DMR Parsing Baseline models and Results
The details of the DMR parsing models are as follows: BiLSTM+GloVe The encoder is a two-layer bidirectional LSTM (BiLSTM) and the decoder is a two-layer uni-directional LSTM. The word embeddings are initialized with GloVe840B-300d.

RoBERTa-base
The encoder is RoBERTa-base (Liu et al., 2019), and the decoder is a two-layer randomly initialized transformer with four attention heads and the same hidden size as the encoder.
BART-base BART ) is a powerful pretrained encoder-decoder model for Seq2Seq tasks. We finetune the BART-base directly for DMR Parsing. Setting context size c = 1, the performances of the DMR Parsing models are shown in Table 1a.

Model
Hyper-parameter Value  We can see that the performances of RoBERTabase and BiLSTM are close, and BART-base outperforms them by a large margin. This suggests that a well-pretrained Seq2Seq model is significantly important for DMR parsing task.

B.2 Hyperparameters
DMR Parsing The DMR parsing models share the following hyperparameters: Adam optimizer, batch size 10, greedy search, and ten training epochs. Others are listed in Table 5.

Coreference Resolution
The hyperparameters for the GNN-based coreference resolution model are listed in Table 6. Note that the value of threshold β is tuned on the development set during training. Specifically, during each validation step, pick values in list [0.01, 0.02, · · · , 0.09] as the thresholds and calculate the accuracies, choose the threshold with highest accuracy as β.

C Analysis of the DMR Parsing model
To investigate factors that affect the performances, we analyze the DMR Parsing model from four as-   pects: 1) the size of the dialogue context, 2) the depth of the target DMR, 3) number of nodes in the target DMR, and 4) the content length of the utterance.
Context Size For each DMR Parsing model, we vary the context size c from 0 to 3. The comparison of the results are listed in Table 7. The models get the best results with one or two context utterances, indicating the DMR Parsing is highly context-dependent.
The following analysis are based on the test set results reported in Table 1a.

DMR Depth
We compare the performance of different DMR parsing models at different DMR depths in Figure 6a. The performance drops for all models as the depth gets larger. Thus the DMR depth is a good indicator of the task complexity.
Node Number The more nodes in a DMR, the longer sequence to predict. Results in Figure 6b in line with our intuition. Moreover, we see the BARTbase model performs much better than the other two on large DMR targets, indicating that a wellpretrained decoder is critical for long sequence

generation.
Utterance Length We plot the DMR parsing results for different utterance lengths in Figure 6c. As expected, the models perform worse on longer utterances, and the BART-based model outperforms others substantially on these challenging test cases. This may be due to the correlation of the length of utterance and target sequence: in general, the more people say, the more information to be delivered. Thus this result is consistent with Figure 6b.