Towards Transparent Interactive Semantic Parsing via Step-by-Step Correction

Existing studies on semantic parsing focus on mapping a natural-language utterance to a logical form (LF) in one turn. However, because natural language may contain ambiguity and variability, this is a difficult challenge. In this work, we investigate an interactive semantic parsing framework that explains the predicted LF step by step in natural language and enables the user to make corrections through natural-language feedback for individual steps. We focus on question answering over knowledge bases (KBQA) as an instantiation of our framework, aiming to increase the transparency of the parsing process and help the user trust the final answer. We construct INSPIRED, a crowdsourced dialogue dataset derived from the ComplexWebQuestions dataset. Our experiments show that this framework has the potential to greatly improve overall parse accuracy. Furthermore, we develop a pipeline for dialogue simulation to evaluate our framework w.r.t. a variety of state-of-the-art KBQA models without further crowdsourcing effort. The results demonstrate that our framework promises to be effective across such models.

Existing studies on semantic parsing focus primarily on mapping a natural-language utterance to a corresponding logical form in one turn. However, because natural language can contain a great deal of ambiguity and variability, this is a difficult challenge. In this work, we investigate an interactive semantic parsing framework that explains the predicted logical form step by step in natural language and enables the user to make corrections through natural-language feedback for individual steps. We focus on question answering over knowledge bases (KBQA) as an instantiation of our framework, aiming to increase the transparency of the parsing process and help the user appropriately trust the final answer. To do so, we construct INSPIRED, a crowdsourced dialogue dataset derived from the COMPLEXWEBQUESTIONS dataset. Our experiments show that the interactive framework with human feedback has the potential to greatly improve overall parse accuracy. Furthermore, we develop a pipeline for dialogue simulation to evaluate our framework w.r.t. a variety of state-of-the-art KBQA models without involving further crowdsourcing effort. The results demonstrate that our interactive semantic parsing framework promises to be effective across such models. 1

Introduction
Semantic parsing aims to map natural language to formal meaning representations, such as λ-DCS, API calls, SQL and SPARQL queries. As observed in previous work (Liang et al., 2013;Yih et al., 2014Yih et al., , 2015Talmor and Berant, 2018b;Chen et al., 2019;Lan and Jiang, 2020a;Gu et al., 2021), existing semantic parsers still face major challenges: (1) the accuracy of state-of-the-art semantic parsers is not high enough for real use, given that natural language questions can be ambiguous or highly 1 Our code and dataset will be available here https:// github.com/molingbo/INSPIRED.
What is the official language of the country that contains Al Sharqia Govenorate?
Not quite. Replace question 2 with "What is the official language spoken in the above-named nation?"  Figure 1: Example dialogue from the INSPIRED dataset. The agent turns (A_i's) illustrate our emphasis on transparency by explaining the predicted logical form step by step in natural language, along with intermediate answers, to the user for feedback.
variable with many possible paraphrases, and (2) it is hard for users to understand the parsing process and validate the results.
In response to the challenges above, recent work (Li and Jagadish, 2014;He et al., 2016;Chaurasia and Mooney, 2017;Su et al., 2018;Gur et al., 2018;Yao et al., 2019a;Elgohary et al., 2020) has started to explore interactive semantic parsing, which involves human users in the loop to provide feedback and boost system accuracy. For example, Su et al. (2018) conducted a systematic study showing that fine-grained user interaction can greatly improve the usability of natural language interfaces to Web APIs. Yao et al. (2019a) allow their semantic parser to ask users clarification questions when generating an If-Then program. And recently, Elgohary et al. (2020) crowdsourced the SPLASH dataset for correcting SQL queries using natural language feedback.
Compared with these approaches, we aim to enhance the transparency of the parsing process and increase user confidence in the answer they receive. We design an interactive framework for semantic parse correction (Figure 2) that can explain the predicted complex logical form in a step-by-step, decompositional manner and enable the user to make corrections to individual steps in natural language. In our framework, once the logical form for a given question is predicted by a base semantic parser, we decompose it into sub-logical forms (Logical Form Decomposition) and translate each sub-logical form to a natural language question (Sub-Question Generation), which can illustrate the steps of answering the question, allowing the user to see exactly how a final answer is found and be confident that it is correct or make corrections to individual steps. Figure 1 shows a dialogue as an example between the user and our framework.
To demonstrate the advantages of our interactive framework, we propose an instantiation of it for the task of question answering over knowledge bases (KBQA), where interactive semantic parsing has remained largely unexplored. To develop such a framework, we construct a dialogue dataset via crowdsourcing, based on complex questions from the COMPLEXWEBQUESTIONS (CWQ) dataset (Talmor and Berant, 2018b), which is a widely used dataset for complex QA. During crowdsourcing, we provide template-based sub-questions and Turkers are asked to paraphrase them. This dataset, dubbed INSPIRED (INteractive Semantic ParsIng for CorREction with Decomposition), will facilitate further exploration of interactive semantic parsing for KBQA.
Upon the collection of INSPIRED, we study two core sub-tasks on this dialogue dataset in our framework: parse correction with natural-language feedback and sub-question generation, where the former regenerates the logical form using human feedback while the latter generates natural language sub-questions based on sub-logical forms to help users understand the answering process. We note that INSPIRED is constructed in such a way that it depends on the selected base parser; however, our interactive framework is very general and can be instantiated with different semantic parsers. Moreover, we show that we can use the sub-question generator trained on INSPIRED to simulate dialogues like Figure 1 for other parsers, which allows us to train a parse corrector for those parsers and study the promise of our interactive framework without involving more annotation effort.
Our main contributions are as follows: (1) We design a more transparent interactive semantic parsing framework that explains to the user how a complex question is answered step by step and enables them to make corrections to each step in natural language and appropriately trust the final answer. (2) We curate and release a new dialogue dataset using this framework to support more research on interactive semantic parsing for KBQA. (3) We establish several baseline models for the parse correction and sub-question generation tasks with thorough error analysis for further improvement. (4) Although IN-SPIRED is constructed based on a selected base parser, it helps us train models to simulate user feedback and study the promise of our interactive framework for correcting errors produced by other semantic parsers. With these contributions, we hope to inspire many interesting directions of future work, which we discuss in the end.

Overview
In this work, we seek to design an interactive semantic parsing framework to correct initial parses produced by KBQA semantic parsers via step-bystep user interaction. Formally, given a complex question, a knowledge base, a predicted parse from a base parser, and natural-language feedback from Ø Sub-Q 1 + Ans 1 (Egypt) Ø Sub-Q 2 * + Ans 2 * (Modern Standard Arabic)

Sub-LF 1
Sub-LF 2 Sub-LF 1 * Sub-LF 2 * Figure 3: Illustration of the full process of our framework for complex question answering over KB via interactive semantic parsing. The initial question here is the same as the one in Figure 1, in which the process can be seen from the user perspective.
the user, the goal is to derive a corrected parse. Under this framework, we define two core tasks: (1) parse correction with natural language feedback; (2) generating a natural-language question from a sub-logical form (sub-question generation).
To construct such a framework, as shown in Figure 2, we (1) first break down the predicted parse and then translate each piece into a naturallanguage form that is easy for users to understand (i.e., sub-question generation task), (2) construct an interactive platform using ParlAI (Miller et al., 2017) to include users in a dialogue, showing translated pieces along with intermediate answers and enabling them to give feedback for each piece in turn, (3) utilize human feedback at each specific turn, along with the contextual information so far, to correct the current piece of the parse, (4) finally compile corrected pieces together to generate a more accurate parse for the original complex question. The parse correction task refers to (3) and (4). We compare the corrected parse and the gold parse using an exact-match metric and report the average accuracy across the testing examples. In order to build models for sub-question generation and parse correction, we need to construct a dialogue dataset, which will be detailed in Section 3.
In this work, we focus on correcting errors in predicates, filtering conditions and comparatives/superlatives. As a simplifying assumption, we take gold named entities mentioned in a question as given. Specifically, we replace named entities in a logical form with special tokens such as #entityX#, where X is a number corresponding to the order in which the entity appears in the logical form. After parsing, we replace these tokens with the gold entities. We acknowledge that addressing errors caused by named entity recognition and linking in a real KBQA system is also challenging and leave it as an important piece of future work.

Crowdsourcing Workflow for Dialogue Construction
The INSPIRED dataset consists of over ten thousand structured dialogues between a user, who asks a complex question, and an agent, who seeks to answer it. Each dialogue turn of these interlocutors has prescribed content and Amazon Mechanical Turk (AMT) workers were employed in order to transform the content into more natural language. The dataset consists of dialogues that resemble Figure 1, but also includes templated versions of the sub-questions, corresponding predicted and gold SPARQL queries (logical forms), and answers to the sub-questions.

Data Preparation for Crowdsourcing
While the goal of each dialogue is to answer a complex question, making the process by which this is achieved clear is an essential research direction of this dataset. Each dialogue features a decomposition process by which our system has transformed the complex question into an initial parse using a base parser, broken it into sub-logical forms, retrieved answers, and presented this whole process to the user for correction. The overarching strategy of the decomposition process to first identify the predicates that express distinct components in the logical form of the complex question, which correspond to the individual sub-questions. Typically, these components take the shape of a "triple" in the logical form, which is comprised of an entity, a predicate, and another entity. Logical forms in the CWQ dataset typically contain two or three of these components. There can be multiple predicates that group together to express one component, for example predicates connected by a CVT node, in which case the two predicates along with their two entities will form one component. Within these, there can be filters and/or restrictions, which provide additional information about the entities of the main predicate. This additional information is typically expressed directly in the sub-question, though the exception is superlative-type questions which add an additional, final sub-question that expresses the selection of one entity among a list. For example, Of these, which is the largest/smallest/etc. More details about how these logical forms get translated into sub-questions can be found in Appendix A, sections A.2 and A.3.
Because SPARQL queries are not easily understandable for the average user, the entire decomposition process must be translated into English, as illustrated in Figure 3. To obtain natural sounding questions, as in Figure 1, we first translated the decomposed logical forms into English questions using templates. Then, we conducted a crowdsourc-Phased Crowdsourcing Protocol Phase 1: Tutorial 1. Gives examples and explanations of the task 2. Provides specific instructions for how to rephrase questions of different types Phase II: Qualification Quiz 1. Worker completes an 8-question multiple choice quiz. Quiz questions are based on the tutorial content. 2. Worker must achieve at least 7 out of 8 to pass. They may take the quiz more than once, but there is a ten minute wait period between attempts.
Phase III: Trial Period 1. Worker completes 10 predetermined tasks which were chosen as representative examples for all the tasks 2. Tasks are manually graded. If the work is overall good, the worker receives specific feedback on anything that was done incorrectly.
3. If quality is not good, worker is eliminated. 4. Workers get paid the regular rate for each task and upon completing the 10 tasks, receive a bonus for the time spent on the tutorial and qualification quiz.
Phase IV: Batches of Tasks 1. Worker is given access to a batch of 100 tasks, which are spot-checked for quality. A bonus is given as the worker passes each set of 100 tasks.
2. If quality is good, workers are given a second batch of 100 questions, also spot checked. 3. Batch size increases based on worker quality and speed. 4. Worker completes up to 1800 tasks.
General Principles 1. Prompt feedback, payment, and release of new batches 2. Provide a link to the tutorial so that it can be accessed at any time 3. Higher than average payment 4. Keep pool of workers small for better communication and quality control. 5. Verify that workers are native English speakers. ing task on AMT, in which the Turkers' main task was to rephrase the questions from the clunky, templated form into more concise and natural English.
We collected dialogues for approximately onethird of the total questions in the CWQ dataset. This data included almost all of the test and validation sets and a subset of the training data. In order to reduce the cost of data collection, we conducted an analysis of repeated predicates and question types, ensuring that each predicate occurred at least three times in the training data when possible. (Some predicates, however, only occurred once or twice in the dataset; in those cases we included all the occurrences.) We also ensured that every dialogue that involved edit operations was included and that we had adequate coverage of each question type. Even with this cost-reduction strategy, it is still possible to simulate the dialogues that were not collected using our simulation pipeline described in Section 5.3. In total, we collected 3,492 questions from the training set. We omitted a small set of questions from the test and validation sets due to constructions that were consistently confusing to workers. These questions can be found in the supplementary documents of the dataset. Thus, we collected 3,441 dialogues from the test set and 3,441 from the validation set. The task was con-ducted using a ParlAI interface, which allowed us to set up a versatile dialogue interface in AMT.

Crowdsourcing Protocol
Because the crowdsourcing task for this dataset required extensive instructions and attention to detail, we designed the task quite carefully with multiple stages of checkpoints to ensure quality data collection. An overview of these stages can be seen in detail in Table 1.
Workers were first required to read through a tutorial that consisted of 40+ slides that explained the task, the ParlAI interface, and the intricacies of each type of sub-question they might encounter (Phase I). It also provided instructions for how to handle any errors they might encounter and how to provide feedback on dialogues.
After reading this tutorial, workers then needed to pass a multiple-choice qualification quiz to ensure that they understood the main points of the tutorial (Phase II). The quiz consisted of eight questions that focused on the most important aspects of the tutorial. In order to pass, the worker had to answer at least seven questions correctly. They were permitted multiple attempts, though there was a ten-minute wait period between attempts.
This quiz was automatically graded and when a worker passed, they immediately gained access to a set of ten dialogue tasks. These ten dialogues were the same for every worker and were handselected to represent the range of various types of dialogues (Phase III). These tasks were manually graded for accuracy, quality of paraphrases, and general understanding of the task.
If their work was adequate, the worker was given feedback and granted access to 100 tasks (Phase IV). From this point, their work was spot-checked and they were granted bonus payments and given access to more tasks in an ongoing fashion.
This phased strategy, while requiring more effort, proved to be effective in recruiting and retaining a small set of exemplary workers for a fairly detailed task and ensuring overall data quality.

Correction Operations
Each dialogue in the INSPIRED dataset consists of a "correction" step after the decomposition process, where the agent asks the user if any corrections are needed (Figure 1, turn A1), and the user either confirms that the initial parse is correct or provides corrections (turn U2). The user's feedback  content at each dialogue turn is formulated automatically. We first decompose both the parser's predicted SPARQL query and the gold SPARQL query into sub-logical forms, and compare those sub-logical forms to determine the sequence of operations needed to transform the predicted parse into the gold parse, such as inserting, deleting, or replacing a sub-question, which is the crowdsourced natural language paraphrase of a sub-logical form. In our current framework, a user must edit a subquestion as a whole using one of the operations mentioned previously and cannot do more finegrained editing within a sub-question, which would be interesting future work. A much more detailed account of the dataset creation process can be found in Appendix A.

Dataset Analysis
In this section, we study the characteristics of the errors made by the initial parser and conduct a thorough quality analysis of our INSPIRED dataset.

Error Characteristics of Initial Parser
It is important to note that our initial parser was purposefully not state-of-the-art, as we wanted to have a wide distribution of errors around which we could create dialogues. (See Section 5 for details about the initial parser.) Similar to other interactive semantic parsing work, we envision that the user will provide corrections to the sub-questions, though we at this stage require the user to use only the three operations of deleting, replacing, or inserting a whole sub-question. We leave more fine-grained edits, such as editing elements within a sub-question, for future work. Table 2 shows the distribution of subquestions whose original complex question is of each of the four main types. Within these types, the distribution of edit operations per sub-question is shown. Though many of sub-questions do not need any edits, the replace operation is most frequent of edit operations, appearing in roughly 36.5% of each type, while insert is roughly 23.3% and delete is around 1.2%, with no action making up the remaining 39%. These distributions indicate that the parser is more likely to predict something incorrect  or leave out a sub-question, rather than predict a sub-question that is not present in the gold.

Data Quality of INSPIRED
In this section, we highlight aspects of the IN-SPIRED dataset that contribute to its overall quality, including contextual awareness and paraphrasing characteristics.
Overall Data Quality. In the end, we collected 10,374 dialogues from 14 different Turkers. Various statistics about these questions can be seen in Table 3. Each worker completed at most 1,800 dialogues in total. In each dialogue, the Turker was required to rephrase the original complex question and each templated sub-question. Overall, we believe the quality of the data to be high for a few reasons. In the collection process, our Turkers had to read a detailed tutorial, pass two qualification tasks, and have their work spot-checked at each stage of collection. Because we kept our pool of workers small, we were able to maintain frequent communication with them throughout the collection process, giving feedback in an ongoing fashion.
Furthermore, a semi-automatic data cleaning method was employed to identify inaccurate paraphrases, which were then manually repaired. In order to ensure cycle consistency (Zhu et al., 2017)-that is, ensure that the meaning of the rephrased question reflects the meaning of the original question-we fine-tuned Hugging Face's implementation of T5 in a sequence-to-sequence model, in which the input was the Turkers' rephrased questions and the output was the corresponding templates (Wolf et al., 2020;Raffel et al., 2020). Templates were used as meaning representations for this task, as T5 is a pretrained language model and thus templates would better resemble the data on which it was trained, as opposed to SPARQL queries. In order to evaluate the approach, a heldout dataset consisting of a random 5% of all the sub-questions that appear in the entire dataset were annotated for accuracy. This revealed an error rate of 4.4%, which we expect is representative of the INSPIRED dataset overall before applying the cleaning strategy. Then a blended ranking method, which utilized the edit distance scores and likelihood scores between the gold template and the model-generated template, was used to sort the pairs of paraphrases. The expectation was that this sorting method would filter errors to the top of the ranking, thus we manually reviewed the top 4.4% scoring pairs, corresponding to the error rate. In this manner we hoped to identify as many errors as possible while balancing the time and effort spent manually reviewing items.
The cleaning method described above successfully recovered 32% of the errors in the held-out dataset, which is slightly above seven times higher than if we had randomly selected a subset of the data for review. This method was then applied to the entire INSPIRED dataset, resulting in edits to 325 sub-questions out of 1,450 sub-questions that were manually reviewed. The index numbers for these edited questions can be found in the supplementary materials. Based on observation on the held-out data, the estimated error rate is around 3.1% after cleaning.
Contextual Awareness in Dialogue. In the dialogue, we provide answers to the sub-questions when possible, making the dialogue context-rich and providing the user with as much information as possible to help them understand the decomposition process of their original query.
This context-awareness can also be seen in the sub-question paraphrases. Our Turkers were encouraged to paraphrase questions in a manner that accounted for the overall context of the question, particularly with regard to named entities. For example, when a second sub-question referenced the answer of the first sub-question, we asked the Turkers to reference that entity without naming it explicitly, but also using a more specific phrase than entity. An example of this can be seen in Figure  1, where instead of directly incorporating the answer of the first question (Egypt) into the second question, they referenced it using the phrase the above-named nation. The goal of this strategy was to create a dataset of dialogues that are context-  Table 4: Examples of sub-questions in their actual context vs. a random context that utilizes the same predicate in its logical form. The sub-question was substituted for the one that used the same logical form (marked with *) in the random context when calculating ROUGE scores. Lexical overlap of the sub-question with each context is represented by bold text. Entities have been replaced with #entity# tokens in order to avoid disadvantaging the random context due to overlap in named entities.  aware and grounded, on which generation models could be trained to mirror this behavior. By using less specific phrases than the entity names, our model was better able to generalize across examples during training. However, one can envision that in a real-use situation, it might be more natural for a user to simply use Egypt instead of the above-named nation when correcting sub-question 2. While our current framework is not able to accommodate this behavior, a simple data augmentation procedure in which referring expressions are replaced with the named entities should allow our system to accommodate this. We leave this data augmentation for future work, but plan to implement it upon conducting a study with real users.
In order to demonstrate contextual awareness, Table 5 shows the average ROUGE-1 and ROUGE-2 scores of all sub-questions in their actual contexts (the complex question and any preceding subquestions) in comparison to the same sub-questions in a randomly assigned context that utilizes the same sub-logical form. Entities were masked with #entity# tokens to prevent the actual context from being advantaged by overlap in entity names. The higher scores for the actual context indicate that the wording of sub-questions reflect the context from which they are derived. Table 4 shows examples of sub-question with these context comparisons.
Paraphrasing characteristics.  words) between all the templates in the INSPIRED dataset and all the rephrased versions of the subquestions, which were calculated using GEM evaluation metrics (Gehrmann et al., 2021). These numbers indicate that the rephrased questions are much more diverse in phrasings and lexical choices. Further, the mean length of the templated questions is 17.3 words, while the mean length of the rephrased questions is 10.7 words. This, along with Table  6, demonstrates that the rephrased questions show much more diversity in their language, but are also more concise. More N-gram metrics calculated using GEM can be seen in Appendix B.
In order to better understand the methods by which Turkers rephrased templates, 100 randomly selected sub-questions were studied in terms of the lexical relationships between the template and rephrased versions. Table 7 shows the results of this analysis. "Lexical match" refers to the the proportion of words in the rephrased version that also appear in the template, relative to the total number of words in the rephrased version. The percentage shown here is an average of all of those proportions. Synonymy, hypernymy, and hyponymy refer to the number of questions in the 100 selected items that contain an instance of one of these lexical relations. In this chart, hypernymy refers to when the rephrased question contains a hypernym of a word that appears in the the template, and vice versa  for hyponymy. It is clear, therefore, that Turkers are using these strategies in their rephrasings of the templates, in addition to simply changing word order. On average, a bit less than half the words in a rephrased question are newly introduced by the Turker, and 56% of the time they are using synonmy, hypernymy, hyponymy, or some combination of these to rephrase the templated question.

Experiments
In this section, we first explore several base semantic parsers and show how we choose one as the initial parser to construct INSPIRED. Then, we conduct extensive experiments on the two core sub-tasks (i.e., sub-question generation and parse correction) under our interactive framework. Finally, in order to study the promise of our framework for other base parsers (beyond the one used to construct INSPIRED) without introducing extra crowdsourcing effort, we simulate dialogues based on our trained models for sub-question generation and parse correction. Firstly, we explore three base semantic parsers, Transformers (Vaswani et al., 2017), BARTlarge (Lewis et al., 2020) and QGG (Lan and Jiang, 2020b). In the official leaderboard 2 of CWQ, QGG is the best-performing method in the line of staged query graph generation approaches. Models like NSM+h (He et al., 2021) and PullNet (Sun et al., 2019) directly output final answers without logical forms, which can not be applied in our framework. CBR-KBQA (Das et al., 2021) is the SOTA model on this dataset as of the submission time, but its code is not available. So we choose Transformers and BART-large as the other two candidate parsers. Since entities are masked in the LFs for these two Seq2Seq models, we provide QGG with gold entities extracted from gold logical forms for a fair comparison. We report their exact-match (EM) and  F1 score 3 in Table 8. We finally select Transformers as the initial parser because it is neither state-of-the-art nor has overly poor performance. As the intention is to create a dataset that represents a wide range of errors and correction strategies for them, a "middle-of-theroad" parser is best for achieving good coverage of error types but also being of decent quality. We will explore the other two models in Table 8 as well through simulation experiments in Section 5.3 later.
In the following two sections, we explore those two core sub-tasks defined above under our framework. We present and evaluate several baseline models including Seq2Seq (Sutskever et al., 2014), Transformers (Vaswani et al., 2017), BART-base and BART-large (Lewis et al., 2020) for each task, in which we use INSPIRED for training and testing.

Parse Correction with NL Feedback
We formally define parse correction with natural language feedback at each turn: given the current sub-question q provided by the user as feedback, different contexts including the history of sub-logical forms h lf and sub-questions h q from previous turns, the task is to generate a sub-logical form p for the current turn. After that, we compile those corrected sub-logical forms from different turns using correction operations to finally produce a corrected parse P for the entire question.
Based on the generated logical forms by the initial parser, Table 9 lists correction performance from four baseline methods without considering any context. We report both the turn-level accuracy-the accuracy of the testing sub-logical forms-and the dialog-level accuracy-the endto-end accuracy of the entire logical forms after correction-on our test set. We use beam search of beam size 10 to generate logical form sequences as candidates. Since models like BART adopt the sub-  Table 9: Turn-level and Dialog-level accuracy of different models after incorporating feedback.
word tokenization scheme, the validness of predicates generated by concatenating a bunch of subwords can not always be guaranteed. We filter those logical forms with invalid predicates and exclude errors that repeat ones made the initial parser.
The results in Table 9 suggest that: (1) incorporating human feedback is quite helpful for correcting erroneous parses, significantly improving the parse accuracy compared with 52.3 EM score of the initial parser without correction, (2) models equipped with transformers perform better than the LSTM-based model, (3) using a BART-large model with pretraining as the correction model achieves the best performance, achieving 19.0 points higher than the initial parser in terms of the dialog-level EM score.
Then, using BART-large as the correction model, we further study the correction process by modeling different contexts, including the history of sub-questions h q and sub-logical forms h lf . We report both the correction accuracy for each turn and the end-to-end accuracy. As shown in Table 10, we find that: (1) Adding context into the correction model's input is indeed useful to further improve the correction accuracy, compared with the scenario without considering contexts. (2) As the number of turns goes up, context contributes more to the correction process, especially in the third and fourth turns. This indicates that including the full dialogue history in the input leads to the best results. Further, constrained relations or numerical operations usually appear in the later turns, the pattern of which could be easier to capture after considering the context. (3) The BART-large model with inputs that leverage the history of the sub-questions and sub-logical forms achieves the best performance, with a 21.2 increase in the dialog-level EM score compared to the initial parser.
Error analysis. We sampled 100 erroneous predictions of BART-large under the best-performing setting in Table 10. In this analysis it becomes clear that longer, more complicated logical forms are more likely to be mispredicted. Only 21 of the er-   rors involved single predicates, while 54 erroneous parses occurred with CVT (Compound Value Type) predicates, which are essentially two predicates combined together via CVT nodes (for example ?y in Table 11) that functions as a single predicate. 13 errors occurred on restriction predicates, which co-occur with single or CVT predicates to further limit the entity type. For example, predicates of the location domain might occur with a restriction that limits that predicate to locations of the type country. The remaining 12 errors all occurred due to only partially generating a long logical form that contain filters. More details regarding restriction predicates and filters can be found in the Appendix, section A.3.1.

Sub-Question Generation
Question generation aims to automatically translate a sub-logical form into a human-readable natural question. Given a current sub-logical form p or the corresponding templated sub-question q t , different contextual information including the original complex question q c , the sub-logical forms h lf from previous steps or the history of templated sub-questions h q t , the task is to generate a natural sub-question q.   Table 12 lists generation performance from four baseline models without considering any context. For each model, we explore two scenarios with different inputs: (1) sub-logical form p only and (2) a concatenation of p and the corresponding templated sub-question q t . We report both BLEU scores based on n-grams overlap and BERTScores measuring semantic similarity. The results in Table 12 suggest that: (1) models equipped with transformers perform better than the LSTM-based model and using a BART-large model with pretraining as the generation model achieves the best performance.
(2) Incorporating the templated sub-question into the model input can further improve both BLEU and BERTScore performance on all baseline models, which actually makes sense because some tokens in q t can be directly copied into the output question. Overall, generated questions are highquality and semantically similar the human-written ones, although they might not appear very similar on the surface form.
Furthermore, we use the best-performing scenario (i.e. a BART-large model with both p and q t as the input) in Table 12 as the basic setting to explore the modeling of different contextual information. We report both BLEU and BERTScore similarly. As shown in Table 13, we find that (1) compared with the scenario with no context, adding context into the model's input can obtain slightly higher metric scores, which suggests that context can be used to help the question generation model in a dialogue. (2) Those settings with the input that incorporates the complex question generally perform better than the others, since the complex question contains the semantics of the sub-question to be generated. (3) The BART-large model with the input containing both the complex question and the history of templated sub-questions achieves the best performance, which supports the effectiveness of leveraging both context and complex questions. We also tried to incorporate the history  Table 13: Comparison of question generation performance when considering different contexts in the input.

Human-Written
Which of the above named people did the voice of toki?
Machine-Generated Which of these people played the role of toki?
Error Explanation Generated question does not specify that the role was a voice acting one.

Human-Written
What famous person has addison's disease?
Machine-Generated who has suffer from addison's disease?
Error Explanation Grammatical error

Human-Written
What district does that politician represent?
Machine-Generated What district does that person represent?
Error Explanation Generated question is slightly less specific of sub-logical forms h lf into the input, but it could not help further improve the performance. Error analysis. We conducted a brief analysis on 100 randomly selected pairs of human-written question and machine-generated question that correspond to the same logical form. We first examined questions from the best-performing model according to BLEU scores and BERTScores, which used the current sub-logical form, the current templated sub-question, the complex question and the history of templated sub-questions from previous steps as context. Questions in which the machine-generated and human-written versions exactly matched each other were excluded. This analysis revealed that only three generated questions (3%) were of perceptibly worse quality than the human-written questions, as can be seen in Table 14. Further, there were four cases in which the human-written questions contained grammatical errors, whereas the machine-generated ones did not. An analysis of all the generated questions which do not exactly match their human-written counterpart reveals that 64% of the generated questions are shorter in terms of number of words.
Because BLEU scores do not necessarily paint a full picture of the model performance, we then examined the generated responses from the model that produced the lowest BLEU scores, which was the model with no context. By examining the same 100 samples as in the previous analysis, we noted twenty cases in which the best-performing model that leveraged context better reflected that context what is the currency used in that country? Table 15: Comparison of 100 generated sub-questions from models with and without context in their inputs. The bolded text in columns 2 and 3 highlight what enhanced the quality of the generation in comparison to its counterpart. Model 1 refers to the model that used the complex question and previous templated questions as context (row 6 in Table 13) and Model 2 refers to the model that did not use context at all (row 1 in Table 13). in its rephrasing than the model that did not leverage context. There were, however, 6 cases in which the model without context did this better and in the remaining cases there was no discernible difference between the quality of the generations from the two models. Table 15 shows examples of each of these cases, for illustration.
While these results are based our observations and certainly require further future investigation and human annotation by people other than the authors, these preliminary results show that the generated questions can be more concise and of comparable quality.

Simulation
In this section, we demonstrate that our framework can pair with other KBQA parsers 4 besides the one used for constructing INSPIRED and use simulated user feedback to correct their parsing errors (due to the high cost of crowdsourcing). To simulate a dialogue, we automatically translate a parser's predicted logical forms into human-readable, natural questions using the subquestion generation model equipped with the bestperforming setting as seen in Table 13, then use oracle error detection and train a generator to simulate a human user's corrections for these dialogues. This generator is a BART-large model that leverages the complex question and templated sub-questions as input to generate natural language sub-questions. To correct parse errors, we use previously trained parse correction model under the best-performing setting in Table 10.   We conduct simulation experiments on BARTlarge (Lewis et al., 2020) and QGG (Lan and Jiang, 2020b) respectively from two mainstream methodologies for KBQA as mentioned above. We report both F1 and EM score for BART-large before and after the correction process using the simulation pipeline. For QGG, since generated query graphs don't take exactly the same format as SPARQL queries, we report F1 score only. As shown in Table 16, the BART-large model achieves a 14.2 EM and 9.9 F1 score gain after correction. Meanwhile, the correction process brings about 7.5 F1 score improvement for QGG model. The results show that our INSPIRED dataset can help train effective sub-question generation and parse correction models, which make our framework applicable to a diverse set of KBQA parsers. Simulating user feedback makes it easy to understand the potential of a base parser under our interactive semantic parsing framework.
Moreover, we expand the simulation experiment to include multiple turns of correction. In order to simulate situations in which the parse correction model does not repair the parse correctly on a first attempt, we use the same human feedback generator to decode several of the highest scoring sequences as candidates for different attempts at correction. We evaluate this strategy after a maximum of three attempts.
Given that sequences decoded by beam search can differ only slightly from each other, producing lists of nearly identical feedback would not be helpful. We adopt diverse beam search (Vijayakumar et al., 2016) instead, which decodes diverse outputs by optimizing for a diversity-augmented objective. We set the beam size as 10, group number as 2 and diversity penalty as 1.5. As shown in Table 17, multiple attempts of correction can further improve parse accuracy. F1 score achieves up to 80.1 after three attempts. Although we assumed the availability of gold entities, the results are still promising in comparison to the 70.0 F1 score of CBR-KBQA (Das et al., 2021), currently the SOTA model on CWQ. We expect CBR-KBQA to do even better given the advantages it has over plain Seq2Seq models. For example, their retrieve module can alleviate errors caused by sparse predicates and revise module can help align correct predicates associated with certain entities. We envision the combination of our framework and theirs as an interesting future line of work.

Related Work
Conversational Semantic Parsing. Conversational semantic parsing (CSP) is the task of converting a sequence of natural language utterances into logical forms through conversational interactions in a context-dependent scenario. It has been studied in different settings including taskoriented dialogues, question answering and text-to-SQL. In task-oriented dialogue systems, datasets like MWoZ (Budzianowski et al., 2018;Eric et al., 2020) and SMCalFlow (Andreas et al., 2020) are created to help users accomplish a specific task (e.g. booking a hotel, checking the weather).  et al., 2019a) are constructed for conversational text-to-SQL tasks. Our work shares the similar objective with these settings about how to jointly represent natural language utterances while considering the multi-turn dynamics of the dialogue. But we differ from them in that our task focuses on KBQA, and aims at evaluating the extent to which models can interpret and apply human feedback on the generated initial parses for parse correction, instead of focusing on modeling conversational dependencies between questions.  2020) introduce SPLASH, a dataset for correcting semantic parsing with free-form natural language feedback. Using natural language as a medium for providing feedback gives the user control and flexibility to specify what is wrong and how it should be corrected. Compared with SPLASH which utilizes tables, we focus on complex question answering on knowledge base with various reasoning types. Additionally, instead of making one-time correction to the entire generated parse, we propose to break down the parse into a sequence of sub-logical forms, and enable the user to correct each sublogical form in natural language step by step.  2020) construct the BREAK dataset and propose QDMR, which is a meaning representation where complex questions are decomposed into a sequence of simpler atomic textual steps. QDMR is an intermediate representation between natural language and logical forms, and is not executable on knowledge bases. In our work, we decompose the logical form of the complex question into sub-queries, which can be directly executed on KB and retrieve answers. Moreover, the work described above decomposes questions to facilitate information retrieval for question answering. Instead, we utilize decomposition to edit and correct the initial parsing of a complex question in a finer-grained level.
Question Generation From KB. With a growing demand for natural language interfaces for knowledge bases (KB), automatic question generation from structured query language has attracted interest to make logical forms interpretable to nonexperts. Based on the development of neural networks, Serban et al. (2016) first propose a encoderdecoder framework for mapping KB fact triples into natural language questions. To generalize to unseen predicates and entity types, Elsahar et al. (2018) leverage textual contexts of triples occurrences in the natural language corpus, paired with a part-of-speech copy action mechanism to generate questions. Instead of only using a single KB triple, Kumar et al. (2019) propose a transformer-based architecture to generate complex multi-hop questions over knowledge graphs. Chen et al. (2020) apply a bidirectional Graph2Seq model for question generation from a sub-graph of KB and target answers. In our work, one core task we focus on is sub-question generation, which can automatically translate formal languages into natural questions and mitigate the need for specialized knowledge to access KB.

Conclusion and Future Work
In this work, we propose an interactive semantic parsing framework that explains to the user how a complex question is answered step by step and enables them to make corrections to each step in natural language, thereby increasing user confidence in the final answer. We instantiate the framework with the task of KBQA, and experimentally show that our interactive framework has the potential to greatly increase the parse accuracy of the initial parser, and that contextual information is effective for further improving both parse correction and subquestion generation tasks within the framework. Moreover, as a pilot study, we design a simulation pipeline to explore the potential of our framework for a variety of semantic parsers, without further annotation effort. The performance improvement shows that interactive semantic parsing is promising for further improving KBQA in general.
The INSPIRED dataset and the preliminary experiments described here provide a foundation for many directions of future work, the most obvious of which is improvements on the parse correction task. This could take the shape of gains in accuracy as well as improvements to the correction strategy. The simulation pipeline provided here can also be used for further experimentation.
We intend to conduct a user study in which our framework is utilized by human users to query a knowledge base in order to validate its viability for real use. At the moment, users are required to make corrections by inserting, deleting, or replacing whole sub-questions, though a useful addition would be modification of components within a given sub-question, which would require a more fine-grained approach to connect SPARQL query components to natural language. Other complementary work could include named-entity recognition and linking to handle entity errors. Lastly, an exciting expansion of this work would be the application of our general framework to other query languages such as SQL.

A Appendix: Dataset Creation Details
The creation of the INSPIRED dataset required careful selection of questions, design of a decompositional approach, and a translation strategy between logical forms and human-readable language. Furthermore, we carefully design a crowdsourcing task to gather more natural-sounding questions to enhance the quality and versatility of our framework.

A.1 Forming Dialogues from the CWQ Dataset
We utilize the COMPLEXWEBQUESTIONS 1.1 (CWQ) dataset (Talmor and Berant, 2018a,b), as this is a common dataset used for complex questionanswering over knowledge bases. This dataset was formed by combining questions from the WE-BQUESTIONSSP dataset (Yih et al., 2016) to form multi-hop complex questions, meaning that they require more than one step to answer. Each question has an associated SPARQL query that functions as a meaning representation of the question. Table 18 shows an example of one of these complex Question What is the official language of the country that contains Al Sharqia Governorate? SPARQL Query <sparql-header-1> ?c ns:location.country. administrative_divisions #entity1# . ?c ns:location.country.official_language ?x .
Answer Modern Standard Arabic Table 18: Example question from the CWQ dataset. The entity "Al Sharqia Governate" has been replaced with "#entity1#". Entities were delexicalized in order to increase generalizability across questions in training.
questions, its associated SPARQL query, and its answer.
We envision that a human user will ask a complex question, the system will predict a SPARQL query for that question, decompose it into pieces, translate those pieces into English to show to the user to solicit feedback. The system will then use that feedback to correct the initial parse, if necessary. Figure 1 shows illustrations of this process.
In order to model this type of dialogue, we utilize a transformer-based sequence-to-sequence semantic parser to predict a SPARQL query for each complex question and decompose the predicted and gold query into pieces, then used these pieces as editable chunks which could be deleted, replaced, or inserted to transform the predicted query into the gold. This process is the framework around which each dialogue is constructed. We translate each step from SPARQL into English to be comprehensible to a human user, thus resulting in dialogues like the one shown in Figure 1, all stemming from questions that occur in the CWQ dataset. Note that the parser used for this purpose is not state-of-theart, as part of the goal was to have a broad coverage of error types for correction.

A.2 Translation of Logical Forms Using Templates
As this dataset leverages SPARQL queries, we then develop a strategy for how to represent these queries in a more comprehensible form that humans can understand. Thus we create a template corpus and rule-based translation method to do so. The corpus consists of 772 different predicates that appear in the CWQ dataset and translations of each into a basic template that conveys the content. The strategy of using templates to make content  more human-friendly has a long history, both utilizing handcrafted templates (Kukich, 1983;McKeown, 1985;McRoy et al., 2000) and rule-based template formation (Angeli et al., 2010;Kondadadi et al., 2013). We use a blend of both approaches to create templates to represent logical forms in a way that is understandable to our Turkers. As can be seen in Table 18, SPARQL queries contain predicates that appear in the form of triples with each component separated by periods, such as location.country.administrative_divisions and location.country.official_language. These triples consist of a domain (location), a type (country) that represents a class within the domain, and a property (administrative_divisions and offi-cial_language) that specify more granular information. These predicates represent content information about the question and can appear in multiple, different questions. For example, the location.country.administrative_divisions predicate maps to the template the country/countries that contain(s) <PH>, where <PH> ("placeholder") gets replaced with a specific entity.
In the parsing process, we delexicalize these specific entities in order to make questions more generalizable and reduce noise during training. For example, in the SPARQL query in Table 18, the replacement token #entity1# appears, which we replace with Al Sharqia Governorate when the template is invoked.
The remaining components of the SPARQL query specify the question typing and any additional components, which we leverage to identify the question type and transform the template into a full sentence. The components will be discussed more fully in section A.3. Thus, this particular SPARQL query translates to the following subquestions: 1. What is/are the country/countries that contain(s) [Al Sharqia Governorate]?
ANSWER: Egypt 2. That entity is/are the country/countries whose official language is what?

A.3 Question Types
Each of the questions in the CWQ dataset can be categorized into one of four major types (Talmor and Berant, 2018b): composition, conjunction, comparative, and superlative. Each type can be identified by the SPARQL query and translated accordingly. Table 19 shows the translation process of the four types with examples of each. The general strategy is to append content to the beginning of the template and replace the <PH> token to form a complete question and express the appropriate question type. As seen in Table 19, this is quite straightforward for composition-and conjunctiontype questions.
Composition questions are composed of two simple questions, where the answer to the first is used to form the second question. As an example, in order to answer the question What is the mascot of the team that has Nicholas S. Zeppos as its leader?, one must first answer In which organization is Nicholas S. Zeppos a leader? in order to have all the content necessary to answer What is the mascot of that organization?. To translate these question types to templated sub-questions, we simply append What is/are before the first template and insert the named entity where the <PH> token appears in the template. Then, That entity is/are is appended to the beginning of the second template and what replaces the placeholder. Note that these positions can be reversed depending on what content is provided in the question. For example, a question could be either of the two options, depending on the goal of the target question: 1. What is/are the organization whose leadership includes a person named Nicholas S. Zeppos?
2. Vanderbilt University is/are the organization whose leadership includes a person named what?
Conjunction questions follow a very similar process, though because their goal is to find the intersection of two categories, the first question returns a list of answers. To account for this, we simply append Of which to the second question before following the same set of rules as the composition questions.
Comparative questions generally have a comparative operator (<, >) and a number contained in their SPARQL query, which we translate simply to less than X or greater than X, as appropriate. Note that the comparative example in Table 19 contains a "restriction predicate", marked by the <RSTR> token. This will be discussed in section A.3.1.
Superlative questions require a slightly more complicated strategy. The first sub-question of a su-perlative type question always generates a list of answer options, while the second sub-question must pair those answer options with numerical information, such as dates or integers. Then, these numbers are ordered, either from smallest-to-largest or vice versa, and the first is returned as the final answer. To account for this, we append These entities are to front of the second template, to make it clear that multiple entities are involved, and return a paired list of entities and their corresponding values as an answer. Then we append a third sub-question that specifies how the questions are sorted and returns a single answer.

A.3.1 Logical Form Features
Within the four main types of questions (composition, conjunction, comparative, and superlative), there are a variety of features that appear. These features include filters, restriction predicates, and union predicates.
Filters act to restrict a list of entities in some fashion by assigning numerical boundaries. An example of this can be seen in Table 19 in the comparative question's SPARQL query, starting with the word filter. This sequence limits the list of entities by ones whose calling codes are larger than 590.
Restriction predicates can appear as auxiliary pieces to regular predicates and typically provide categorical information about an entity. For example, in Table 19, the comparativetype question What country is in the Caribbean with a country calling code higher than 590? has two entities in its SPARQL query, though Caribbean is the only entity that seems to appear in the original question. The two main predicates are location.location.contains and location.country.calling_code, but a third predicate, common.topic.notable_types appears in between them. This predicate acts as a restriction upon the first main predicate; in this case #entity2# corresponds to country and restricts the locations that can appear as answers to the category of countries.
Because restriction predicates are not standalone pieces that could be translated into their own sub-questions, we develop a strategy for incorporating them into the templates of the predicates they restrict. First, we create a corpus of "minitemplates" that correspond to all the restriction predicates that could appear. Much of the time, these mini-templates simply place the entity (like country in the previous example) into parentheses,  though in some cases they situate the entity into a prepositional phrase. Meanwhile, the main template corpus has tokens in place to define where the mini-template should be placed in the main template. One can see in the comparative example of Table 19 that there is an <RSTR> token in the template of the first sub-question. Every main template that can appear with a restriction predicate has this token in its template; though it need not always appear with one. Consequently, if the restriction token does not get replaced, it simply gets deleted.
If the location.location.contains predicate appeared without a restriction predicate, it would simply read Caribbean is/are the location(s) containing what?
Union predicates are a bit of a misnomer, as they are actually a group of predicates that function as though they are a single predicate, and thus correspond to a single template. In Table 20, one can see that the SPARQL query is quite long, with all of the content in bold corresponding to the first sub-question and the remainder corresponding to the second. Within this first sub-logical form, there are several predicates that are joined together by } union {. Collectively, these templates encompass the concept of family by defining all the various relationship roles that are involved in that concept. Theoretically, we could enumerate all of these in template form, separated by or (the brother of John F. Kennedy or the mother of John F. Kennedy or the child of John F. Kennedy...) but this seems to be an unnecessarily complicated and inconcise way of representing these. Instead, we enumerate the various types of union predicates that could appear and create a small corpus of templates that express the overall concept represented by each collection of predicates, thus the Turker will see questions with this feature in the same format as a regular question.

A.4 Crowdsourced Data Collection
As mentioned in section 3.1, the Turk task for this dataset is primarily a paraphrasing task in which Turkers work through a structured dialogue, rephrasing templated sub-questions at each step.
Each task takes the form of a dialogue involving three entities: the "user", which is an automated dialogue partner, an automated "director" that guides the dialogue and provides detailed instructions, and the "agent", which is the role performed by the Turker. Upon entering a task, the Turker is shown the "target question", or the original question from the CWQ dataset, and asked if the question was sensible to them. If so, they are asked to rephrase it using different language. If not, they proceed with the dialogue in the hopes that the decomposition process will make the meaning of the question clear. In these cases, the Turker is asked to rephrase the target question at the end of the dialogue. This process is included to encourage better understanding of the target question and to help us recognize confusing questions in the original dataset and replace them with higher-quality questions when appropriate.
Next, the target question is automatically decomposed into templated sub-questions which are displayed to the worker, who rephrases them into English. These rephrased questions are sent to the automated user, who provides corrections as necessary. The worker rephrases any new questions and the edits are automatically made. At the end of the dialogue, the worker is asked for any feedback regarding the dialogue. This feedback is later used to make corrections and flag any problems that might have arisen. Screenshots of the dialogue interface can be seen in Figure 4.

A.5 Cleaning the Dataset
As mentioned in section 4.2, we employ a semiautomatic data cleaning method to reduce the error rate in the INSPIRED dataset. Because data cleaning can be an expensive and time-consuming process, the goal is to develop a method that would reduce the number of items in the dataset that need to be manually reviewed. Thus we use an automatic method to identify a small subset of the entire dataset that contain as many errors as possible to then manually review. To this end, we utilize a pretrained sequence-to-sequence model that employs the idea of cycle consistency (Zhu et al., 2017), to identify poor paraphrases by retrieving meaning representations (MRs) from questions rephrased by the Turkers. Then these MRs are used to compare against the original MRs and evaluated for similarity.
In order to evaluate the effectiveness of the strategy, a random 5% subset of the entire dataset was selected for annotation, using a binary classifica-tion of whether or not the rephrased question was an accurate paraphrase of the original templated question (and by extension, its original logical form). This annotation effort revealed that 4.4% of the rephrased questions contained errors, which we expect is representative of the entire dataset.
We then fine-tune Hugging Face's implementation of T5 in a seq2seq model to generate MRs, in this case templated sub-questions, to compare to the original MRs (Wolf et al., 2020;Raffel et al., 2020). These pairs of MRs then need to be sorted in a ranked list that would filter paraphrases that are more likely to contain errors to the top of the list. This allows us to use a precision at K measure, which, given a rank K, the precision is calculated over the set of retrieved items with a rank of K or less. For the annotated test set, K equals 75, the number of observed errors. After ranking the list, we could evaluate the quality of the method by looking at the top K data points and checking to find how many errors appear in that set, compared to a random baseline of 4.4% (the observed error rate), or about 3 errors.
We employ two ranking methods to sort the pairs. First, we calculate the negative log-likelihood of the target MRs relative to the model and then do the same for the generated MRs.
S(y) = − y i ∈Y log p(y i |y <i , x; θ) (1) y = y 1 , ..., y |y| y <i = y 1 , ..., y i−1 In Equation 1, S(y) refers to the score of a given output sequence y, which is the sum of the negative log-likelihood of each y i given the sequence of y tokens that came before. θ refers to the model parameters.
Once the negative log-likelihoods are determined for each candidate y, the best candidate is determined based on the lowest score. y * = argmin(S(y|x)) Here, y * refers to the best generated output sequence, and x is a given input sequence. A score for output sequence y * is determined, as well as a score for the target sequence t. D = |S(y * ) − S(t)| While these two scores are comparable to each other, they are not comparable across other item pairs. In order to assign a ranking for every item in the dataset, we calculate the difference D between the negative log-likelihoods of the target MR and generated MR for each question in the dataset and sort them based on the largest difference score, as shown in Equation 3.
Second, we calculate an edit distance score between the target MR and generated MR and sort based on the largest score. If the model has predicted an MR that is substantially far from the target MR in its phrasing, it likely has a different meaning.
Using the first ranking method, 17.3% of the errors were recovered, while the second recovered 32% of the errors. However, because the two ranking methods appeared to be identifying different errors with little overlap, both were used to identify the final set of questions for manual review, drawing from them methods equally.
Then the method was applied to the entire IN-SPIRED dataset, using cross-validation with a series of 90% training, 10% testing splits to generate MRs for every rephrased question. Then, because the annotated dataset had a 4.4% error rate which we expect to be representative of all the data, the top-ranked 4.4% of the data was selected form manual review. This review resulted in 17.7% of the items being revised, meaning that the authors changed the rephrasing to more accurately reflect the original meaning.   Table 21 shows the N-gram statistics of all the templates in the dataset (template corpus) and all the rephrased questions (rephrased corpus). These metrics were calculated using the GEM evaluation scripts (Gehrmann et al., 2021). In this table, Vocab Size refers to the total number of distinct Ngrams, while Distinct refers to the ratio of distinct N-grams divided by the total number of N-grams in the dataset. Unique specifies the number of N-grams that occur only once in the dataset, Entropy is the Shannon entropy over N-grams, and Cond(itional) Entropy is the entropy conditioned on N −1 -grams.