Schema-Guided Natural Language Generation

Neural network based approaches to data-to-text natural language generation (NLG) have gained popularity in recent years, with the goal of generating a natural language prompt that accurately realizes an input meaning representation. To facilitate the training of neural network models, researchers created large datasets of paired utterances and their meaning representations. However, the creation of such datasets is an arduous task and they mostly consist of simple meaning representations composed of slot and value tokens to be realized. These representations do not include any contextual information that an NLG system can use when trying to generalize, such as domain information and descriptions of slots and values. In this paper, we present the novel task of Schema-Guided Natural Language Generation (SG-NLG). Here, the goal is still to generate a natural language prompt, but in SG-NLG, the input MRs are paired with rich schemata providing contextual information. To generate a dataset for SG-NLG we re-purpose an existing dataset for another task: dialog state tracking, which includes a large and rich schema spanning multiple different attributes, including information about the domain, user intent, and slot descriptions. We train different state-of-the-art models for neural natural language generation on this dataset and show that in many cases, including rich schema information allows our models to produce higher quality outputs both in terms of semantics and diversity. We also conduct experiments comparing model performance on seen versus unseen domains, and present a human evaluation demonstrating high ratings for overall output quality.


Introduction
Much of the recent work on Neural Natural Language Generation (NNLG) focuses on generating a * Authors contributed equally and are listed alphabetically. natural language string given some input content, primarily in the form of a structured Meaning Representation (MR) (Moryossef et al., 2019;Wiseman et al., 2017;Gong et al., 2019;Dušek et al., 2018;Colin et al., 2016;Wen et al., 2016;Dusek and Jurcícek, 2016;Dušek and Jurcicek, 2015;Wen et al., 2015). Popular datasets used for MR-to-text generation are confined to limited domains, e.g., restaurants or product information, and usually consist of simple tuples of slots and values describing the content to be realized, failing to offer any information about domains or slots that might be useful to generation models (Novikova et al., 2017b;Gardent et al., 2017;Wen et al., 2015). The satellite eurus 65 is a laptop designed for home use with 4 gb of memory and a medium sized hard drive Only having simple and limited information within these MRs has several shortcomings. Model outputs are either very generic or generators have to be trained for a narrow domain and cannot be used for new domains. Thus, some recent work has focused on different methods to improve naturalness (Zhu et al., 2019) and promote domain transfer (Tran and Nguyen, 2018;Wen et al., 2016).
MRs are not unique to the problem of language generation: tasks such as dialog state tracking (Rastogi et al., 2019), policy learning (Chen et al., 2018), and task completion (Li et al., 2017) also require the use of an MR to track context and state information relevant to the task. MRs from these more dialog-oriented tasks are often referred to as a "schemata." While dialog state tracking schemata do not necessarily include descriptions (and generally only include names of intents, slots, and values like traditional MRs), recent work has suggested that the use of descriptions may help with different language tasks, such as zero-shot and transfer learning (Bapna et al., 2017). The most recent Dialog System Technology Challenge (DSTC8) (Rastogi et al., 2019) provides such descriptions and introduces the idea of schema-guided dialog state tracking. Table 2 shows a sample schema from DSTC8. It is much richer and more contextually informative than traditional MRs. Each turn is annotated with information about the current speaker, (e.g., SYS-TEM, USER), dialog act (e.g., REQUEST), slots (e.g., CUISINE), values (e.g., Mexican and Italian), as well as the surface string utterance. When comparing this schema in Table 2 to the MRs from Table 1, we can see that the only part of the schema reflected in the MRs is the ACTIONS section, which explicitly describes intents, slots, and values.  To our knowledge, no previous work on NNLG has attempted to generate natural language strings from schemata using this richer and more informative data. In this paper, we propose the new task of Schema-guided Natural Language Generation, where we take a turn-level schema as input and generate a natural language string describing the required content, guided by the context information provided in the schema. Following previous work on schema-guided language tasks, we hypoth-esize that descriptions in the schema will lead to better generated outputs and the possibility of zeroshot learning (Bapna et al., 2017). For example, to realize the MR REQUEST(time), domain-specific descriptions of common slots like time can help us realize better outputs, such as "What time do you want to reserve your dinner?" in the restaurant domain, and "What time do you want to see your movie?" for movies. Similarly, we note that for dialog system developers, writing domain-specific templates for all scenarios is clearly not scalable, but providing a few domain-specific descriptions for slots/intents is much more feasible. We focus on system-side turns from the DSTC8 dataset and, to allow our models to better generalize, we generate natural language templates, i.e., delexicalized surface forms, such as "Is there a specific cuisine type you enjoy, such as $cuisine1, $cuisine2, or something else?" from the example schema in Table 2. We chose to focus on the system-side turn as currently, when building a dialog system, developers need to spend a large amount of time hand-writing prompts for each possible situation. We believe that enabling a model to automatically generate these prompts would streamline the development process and make it much faster.
Our contributions in this paper are three-fold: (1) we introduce a novel task and repurpose a dataset for schema-guided NLG, (2) we present our methods to include schema descriptions in state-of-theart NNLG models, and (3) we demonstrate how using a schema frequently leads to better quality outputs than traditional MRs. We experiment with three different NNLG models (Sequence-to-Sequence, Conditional Variational AutoEncoders, and GPT-2 as a Pretrained Language Model). We show that the rich schema information frequently helps improve model performance on similarity-toreference and semantic accuracy measures across domains, and that it promotes more diverse outputs with larger vocabularies. We also present a human evaluation demonstrating the high quality of our outputs in terms of naturalness and semantic correctness.

Data
To create a rich dataset for NNLG, we repurpose the dataset used for the Schema-Guided State Tracking track of DSTC8 (Rastogi et al., 2019). 1 We preprocess the data to create our Schema-Guided Natural Language (SG-NLG) dataset for training and evaluating our NNLG models. 2 Since we are focused on system turns, we first drop all the user turns. The second step in the preprocessing pipeline is to delexicalize each of the system utterances. The original data is annotated with the spans of the slots mentioned in each turn. We replace these mentions with the slot type plus an increasing index prefixed by the $ sign, e.g., $cuisine 1. For example, the utterance "Is there a specific cuisine type you enjoy, such as Mexican, Italian, or something else?" becomes "Is there a specific cuisine type you enjoy, such as $cuisine 1, $cuisine 2 or something else?
The third step is to construct the MR corresponding to each system turn. We represent an MR as a triplet: one dialog act with exactly one slot and one value. Therefore, an MR that in the original DSTC8 dataset is represented as REQUEST(cuisine = [Mexican, Italian]) becomes REQUEST(cuisine=$cuisine 1), REQUEST(cuisine=$cuisine 2) (see Table 3). Note that the MR has been delexicalized in the same fashion as the utterance. Similarly, for MRs that do not have a value, e.g., REQUEST(city), we introduced the null value resulting in REQUEST(city=null). We also use the null value to replace the slot in dialog acts that do not require one, e.g., BYE() becomes BYE(null=null) in order to ensure that each MR is converted to a triplet.
Once we generate templates and MR pairs, we add information about the service. In DSTC8, there are multiple services within a single domain, e.g., services travel 1 and travel 2 are both part of the travel domain, but have distinct schema. 3 DSTC8 annotates each turn with the corresponding service, so we reuse this information. Our schema also includes user intent. 4 Since only user turns are annotated with intent information, we use the immediately preceding user turn's intent annotation if the system turn and the user turn share the same service. If the service is not the same, we drop the intent information, i.e., we use an empty string as the intent (this only happens in 3.3% of cases).
Next, we add information extracted from the schema file of the original data. This includes service description, slot descriptions (one description for each slot in the MR), and intent descriptions. These descriptions are very short English sentences (on average 9.8, 5.9 and 8.3 words for services, slots and intents). Lastly, we add to each triplet a sentence describing, in plain English, the meaning of the MR. These description are not directly available in DSTC8 but are procedurally generated by a set of rules. 5 For example, the MR CON-FIRM(city=$city 1) is "Please confirm that the [city] is [$city 1]." The intuition behind these natural language MRs is to provide a more semantically informative representation of the dialog acts, slots and values. Table 4 shows the SG-NLG dataset statistics. In summary, SG-NLG is composed of nearly 4K MRs and over 140K templates. On average, every MR has 58 templates associated with it, but there is a large variance. There is one MR associated with over 1.7K templates (CONFIRM(restaurant name, city, time, party size, date)) and many MRs with only one template.

DSTC8 (ORIGINAL)
ACTIONS -ACT: REQUEST SLOT: CUISINE VALUES: Mexican, Italian UTTERANCE -"Is there a specific cuisine type you enjoy, such as Mexican, Italian, or something else?" REQUEST(cuisine=$cuisine2)] UTTERANCE -"Is there a specific cuisine type you enjoy, such as $cuisine1, $cuisine2, or something else?"  Table 4: SG-NLG dataset statistics. 5 We have a single rule for each act type; 10 in total.

Feature Encoding
We categorize the features from schemata into two different types. The first type is symbolic features. Symbolic features are encoded using a word embedding layer. They typically consist of single tokens, e.g., service names or dialog acts, and frequently resemble variable names (e.g., restaurant and restaurant name). The second type of features is natural language features. These features are typically sentences, e.g., service/slot descriptions or the natural language MR, that we encode using BERT (Devlin et al., 2018) to derive a single semantic embedding tensor.
To represent the full schema, we adopt a flatencoding strategy. The first part of each schema is the MR, which we define as a sequence of dialog act, slot, and value tuples. At each timestep, we encode a three-part sequence: (1) a new act, slot, and value tuple from the MR, (2) the embeddings of all schema-level features (i.e., services, intents, and their descriptions), and (3) the embedding of the current slot description (see Figure 1). Finally, we append the encoded natural language MR.

Sequence-to-Sequence
Our first model is a Seq2Seq model with attention, copy, and constrained decoding (see the full model diagram in the appendix). We implement the attention from Luong et al. (2015): where align is a function that computes the alignment score of the hidden state of the encoder h t and the decoder hidden state, s t . The goal of this layer is to attend to the more salient input features.
The copy mechanism we add is based on pointergenerator networks (See et al., 2017). At each decoding step t we compute a probability p gen : where w h , w s , and w x are a learnable weights matrix; h * t is a context vector computed by combining the encoder hidden state and the attention weights, s t is the decoder hidden state, x t the decoder input, and b ptr is a bias term. The probability p gen is then used to determine the next word w generated: Thus p gen behaves like a switch to decide whether to generate from the vocab or copy from the input. The goal of the copy mechanism is to enable the generation of special symbols such as $cuisine 1 that are specific to the service.

Conditional Variational Auto-Encoder
The Conditional Variational Auto-Encoder (CVAE) (Hu et al., 2017) is an extension of the VAE models, where an additional vector c is attached to the last hidden state of the encoder z as the initial hidden state of the decoder. The vector c is used to control the semantic meaning of the output to align with the desired MR. We use the encoded feature vector described in Section 3.1 as c. The model objective is the same as VAE, which is the sum of reconstruction loss and Kullback-Leibler divergence loss. At training time, z is the encoded input sentence. At prediction time, z is sampled from a Gaussian prior learned at training time. We also adapt the attention mechanism for CVAE by adding an additional matrix W e to compute the alignment score, wheres t is the decoder hidden state. For Seq2Seq/CVAE, we use constrained decoding to prune out candidate outputs with slot repetitions. We use a beam to keep track of slots that have already been generated and set the probability of a new candidate node to zero if slots are repeated.

Pretrained Language Model: GPT-2
We also experiment with a pretrained language model, specifically GPT-2 (Radford et al., 2019). 6 Since GPT-2 is trained on purely natural language strings, we first combine the symbolic and natural language features into flat natural language strings, similar to previous work by Budzianowski and Vulić (2019). We fine-tune the GPT-2 model using these natural language inputs with the target [Schema 1] ACTIONS (MR): INFORM(price-per-night= $price-per-night1), NOTIFY-SUCCESS(null=null) Slot Desc: price-per-night: "price per night for the stay" Service: hotels-4 Service Desc: "Accommodation searching and booking portal" Intent: ReserveHotel Intent Desc: "Reserve rooms at a selected place for given dates." Natural Language MR: the [price per night] is [$price-per-night1]. the request succeeded. Ref $price-per-night1 a night Seq2Seq your reservation is booked and the total cost is $price-per-night1 . CVAE your reservation has been made . the total cost is $price-per-night1 per night . GPT2 your reservation was successful! the cost of the room is $price-per-night1 per night. [Schema 2] ACTIONS (MR): OFFER(movie-name= $movie-name1), OFFER(movie-name= $movie-name2) OFFER(movie-name= $movie-name3), INFORM(count=$count1) Slot Desc: movie-name: "name of the movie", count: "the number of items that satisfy the user's request" Service: media-2 Service Desc: "The widest selection and lowest prices for movie rentals" Intent: FindMovies Intent Desc: "Find movies to watch by genre and, optionally, director or actors" Natural  template. 7 At prediction time, given the schema tokens as input, we use our fine-tuned GPT-2 model with a language model head to generate an output sequence (until we hit an end-of-sequence token). We adopt top-k sampling at each decoding step.

Evaluation
For each of our three models, we generate a single output for each test instance. Table 5 shows example model outputs.

Evaluation Metrics
We focus on three distinct metric types: similarity to references, semantic accuracy, and diversity. Similarity to references. As a measure of how closely our outputs match the corresponding test references, we use BLEU (n-gram precision with brevity penalty) (Papineni et al., 2002) and METEOR (n-gram precision and recall, with synonyms) (Lavie and Agarwal, 2007). We compute corpus-level BLEU for the full set of outputs and matching references. For METEOR, we com-pute per-output metrics and average across all instances. 8 We include these metrics in our evaluation primarily for completeness and supplement them with a human evaluation, since it is widely agreed that lexical overlap-based metrics are weak measures of quality (Novikova et al., 2017a;Belz and Reiter, 2006;Bangalore et al., 2000).
Semantic accuracy. We compute the slot error rate (SER) for each model output as compared to the corresponding MR by finding the total number of deletions, repetitions, and hallucinations over the total number of slots for that instance (the lower the better). 9 It is important to note that we only consider slots that have explicit values (e.g., MR: INFORM date=$date1) for our automatic SER computations. We are investigating methods to compute SER over implicit slots (e.g., MR: RE-QUEST party size=null) as future work, since it is non-trivial to compute due to the various ways an implicit slot might be expressed in a generated template (e.g., "How many people are in your party?",  or "What is the size of your group?"). We also compute "slot match rate", that is the ratio of generated outputs that contain exactly the same explicit slots as the matching test MR. Diversity. We measure diversity based on vocabulary, distinct-N (the ratio between distinct ngrams over total n-grams) (Li et al., 2016) and novelty (the ratio of unique generated utterances in test versus references in train). 10 Table 6 compares model performance when trained using only the traditional MR versus using the full schema (better result for each model in bold).

Traditional MR vs. Rich Schema
Model comparisons. To get a general sense of model performance, we first compare results across models. From the table, we see that Seq2Seq and CVAE have higher BLEU compared to GPT2 (for both MR and Schema), but that GPT2 has a higher METEOR. This indicates that GPT2 is more frequently able to generate outputs that are semantically similar to references, but that might not be exact lexical matches (e.g., substituting "film" for "movie") since GPT2 is a pretrained model. Similarly, GPT2 has a significantly higher vocabulary and diversity than both Seq2Seq and CVAE.
MR vs. Schema. Next, we compare the performance of each model when trained using MR versus Schema. For all models, we see an improvement in similarity metrics (BLEU/METEOR) when training on the full schema. Similarly, in terms of diversity, we see increases in vocabulary for all models, as well as increases in distinct-N and novelty (with the exception of Seq2Seq novelty, which drops slightly).
In terms of semantic accuracy, we see an improvement in both SER and Slot Match Rate for both CVAE and GPT2. For Seq2Seq, however, we see that the model performs better on semantics 10 To avoid inflating novelty metrics, we normalize our template values. (e.g., "Table is reserved for $date1." is normalized to "Table is reserved for $date." for any $dateN value). when training on only the MR. To investigate, we look at a breakdown of the kinds of errors made. We find that Seq2Seq/CVAE only suffer from deletions, but GPT2 also produces repetitions and hallucinations (a common problem with pretrained language models); however, training using the schema reduces the number of these mistakes enough to result in an SER improvement for GPT2 (see the appendix for details).

Seen vs. Unseen Services
Next, we are interested to see how our models perform on specific services in the SG-NLG dataset. Recall that the original dataset consists of a set of services that can be grouped into domains: e.g., services restaurant 1 and restaurant 2 are both under the restaurant domain. Based on this, we segment our test set into three parts, by service: seen, or services that have been seen in training, partially-unseen, or services that are unseen in training but are part of domains that have been seen, and fully-unseen where both the service and domain are unseen. 11 MR vs. Schema. To better understand how the models do on average across all services, we show 11 We show distribution plots by service in the appendix.   average BLEU/SER scores in Table 7. 12 Once again, we compare performance between training on the MR vs. the schema. On average, we see that for the seen and fully-unseen partitions, training with the schema is better across almost all metrics (sometimes showing no differences for SER for fully unseen). For partially-unseen, we see that CVAE performs better when training on only the MR; however, when averaging across the full test in Table 6, we see an improvement with schema. We see naturally higher BLEU and lower SER for seen vs. both partially-unseen and fully-unseen across all models. Surprisingly, we see higher schema BLEU for CVAE on fully-unseen as compared to partially-unseen, but we note that there is a very small fully-unseen sample size (only 10 test MRs). We also note that GPT2 has high SER for the fully-unseen domain; upon inspection, we see slot hallucination from GPT2 within alarm 1, while Seq2Seq/CVAE never hallucinate.
Seen vs. Unseen. Table 8 shows model performance in terms of BLEU and SER. We sort services by how many references we have for them in test; events 1 for example constitutes 19% of the test references. To focus our discussion here, we show only the top-3 services in terms of percentage of test references. 13 For fully-unseen we show the only available service (alarm 1). We show the best scores in bold and the worst scores in italic. 12 Scores are weighted by the percentage of test references per service in each split, e.g. events 1 in seen makes up 19% of the seen test references, thus its scores are weighted by that factor. 13 We show results for all services in the appendix.
For seen services (Figure 8a), we see the highest BLEU scores for all models on the rentalcars 1. We note that SER is consistently low across all models, with the worst SER for the top-3 services at 0.15 (the worst SER across all of seen is 0.23 as shown in the appendix).
For partially-unseen services (Figure 8b), we see the best SER on restaurants 2 (but comparatively lower BLEU scores). The services 4 domain shows the highest BLEU scores for Seq2Seq and GPT2, with low SER. We note that flights 3 has the worst SER for all models. Upon investigation, we find slot description discrepancies: e.g., slot origin airport name has slot description "Number of the airport flying out from". This highlights how models may be highly sensitive to nuances in the schema information, warranting further analysis in the future.

Human Evaluation
To supplement our automatic metric evaluations which show some the benefits of schema-based generation, we conduct an annotation study to evaluate our schema-guided output quality. We randomly sample 50 MRs from our test set, and collect 3 judgments per output for each model as well as a reference (randomly shuffled). 14 We ask the annotators to give a binary rating for each output across 3 dimensions: grammar, naturalness, and semantics (as compared to the input MR). We also get an "overall" rating for each tem-plate on a 1 (poor) to 5 (excellent) Likert scale. 15 Table 9 summarizes the results of the study. For grammar, naturalness, and semantics, we show the ratio of how frequently a given model or reference output is marked as correct over all outputs for that model. For the "overall" rating, we average the 3 ratings given by the annotators for each instance, and present an average across all MRs (out of 5).  From the table, we see that the CVAE model has the highest score in terms of both grammar and naturalness. Moreover, CVAE also achieves a score higher than the reference in terms of naturalness. A possible explanation explanation for this behavior is that the quality of the reference is subjective, and not always an ideal "gold-standard". In terms of semantics, we see that GPT-2 has the highest ratings of all models. Most interestingly, we see that CVAE has a significantly lower semantic rating, although it is the winner on grammar and naturalness, indicating that while CVAE outputs may be fluent, they frequently do not actually express the required content (see Schema 3 in Table 5). This finding is also consistent with our SER calculations from Table 6, where we see that CVAE has the highest SER. 16 In terms of overall score, we see that GPT-2 has the highest rating of all three models, and is most frequently comparable to the ratings for the references. This can be attributed to its higher semantic accuracy, combined with good (even if not the highest) ratings on grammar and naturalness.

Related Work
Most work on NNLG uses a simple MR that consists of slots and value tokens that only describe information that should be realized, without including contextual information to guide the generator as we do; although some work has described how this could be useful (Walker et al., 2018). WebNLG (Colin et al., 2016) includes structured triples from Wikipedia which may constitute slightly richer MRs, but are not contextualized. Oraby et al. (2019) generate rich MRs that contain syntactic and stylistic information for generating descriptive restaurant reviews, but do not add in any contextual information that does not need to be included in the output realization. Table-to-text generation using ROTOWIRE (NBA players and stats) also includes richer information, but it is also not contextualized (Wiseman et al., 2017;Gong et al., 2019).
Other previous work has attempted to address domain transfer in NLG. Dethlefs et al. (2017) use an abstract meaning representation (AMR) as a way to share common semantic information across domains. Wen et al. (2016) use a "data counterfeiting" method to generate synthetic data from existing domains to train models on unseen domains, then fine-tune on a small set of in-domain utterances. Tran et al. (2018) also train models on a source domain dataset, then fine-tune on a small sample of target domain utterances for domain adaptation. Rather than fine-tuning models for new domains, our data-driven approach allows us to learn domain information directly from the data schema.

Conclusions
In this paper, we present the novel task of Schema-Guided NLG. We demonstrate how we are able to generate templates (i.e., delexicalized system prompts) across different domains using three stateof-the-art models, informed by a rich schema of information including intent descriptions, slot descriptions and domain information. We present our novel SG-NLG dataset, which we construct by repurposing a dataset from the dialog state tracking community.
In our evaluation, we demonstrate how training using our rich schema frequently improves the overall quality of generated prompts. This is true for different similarity metrics (up to 0.43 BLEU and 0.61 METEOR) that we recognize are weak measures of quality but, more importantly, for semantic metrics (as low as 0.18 average SER), and even for diversity (up to 2.6K bigram vocabulary). Moreover, this holds true on both seen and unseen domains in many different settings. We conduct a human evaluation as a more accurate quality assessment, and show how our outputs are rated up to 3.61 out of 5 overall (as compared to 3.97 for references). We observe that different models have different strengths: Seq2Seq and CVAE have higher BLEU reference similarity scores, while GPT2 is significantly more diverse and is scored highest overall in human evaluation.
For future work, we are interested in exploring how schema-guided NLG can be used in dialog system contexts, where only outputs that have no slot errors and high overall fluency should be selected as responses. We are also interested in improving both the semantic correctness and fluency of our model outputs by introducing improved methods for constrained decoding and language model integration. Additionally, we plan to develop more accurate automatic measures of quality, as well as more fine-grained control of domain transfer.

Events 1
The comprehensive portal to find and reserve seats at events near you category Type of event time Time when the event is scheduled to start

B Details of SER Errors
All of the errors made by Seq2Seq and CVAE are deletion errors (constrained decoding prevents repetitions/hallucinations). While using schema leads to more deletions in GPT2, it reduces repetitions and hallucinations, leading to better SER.

C.1 Data Distribution Plots
For the seen set in Figure 2a, we present the distribution of references both in training and test. For the unseen sets in Figure 2b, we present only test reference distribution (since there are no corresponding train references). Table 8 shows the performance of each model across all seen and partially-unseen test sets.