UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models

Structured knowledge grounding (SKG) leverages structured knowledge to complete user requests, such as semantic parsing over databases and question answering over knowledge bases. Since the inputs and outputs of SKG tasks are heterogeneous, they have been studied separately by different communities, which limits systematic and compatible research on SKG. In this paper, we overcome this limitation by proposing the UnifiedSKG framework, which unifies 21 SKG tasks into a text-to-text format, aiming to promote systematic SKG research, instead of being exclusive to a single task, domain, or dataset. We use UnifiedSKG to benchmark T5 with different sizes and show that T5, with simple modifications when necessary, achieves state-of-the-art performance on almost all of the 21 tasks. We further demonstrate that multi-task prefix-tuning improves the performance on most tasks, largely improving the overall performance. UnifiedSKG also facilitates the investigation of zero-shot and few-shot learning, and we show that T0, GPT-3, and Codex struggle in zero-shot and few-shot learning for SKG. We also use UnifiedSKG to conduct a series of controlled experiments on structured knowledge encoding variants across SKG tasks. UnifiedSKG is easily extensible to more tasks, and it is open-sourced at https://github.com/hkunlp/unifiedskg.


Introduction
Structured knowledge (e.g., web tables, knowledge graphs, and databases) stores large amounts of data in organized structures, forming a basis for a wide range of applications, e.g., medical diagnosis, personal assistants, and customer relations manage-ment.Accessing and searching data in structured knowledge typically requires mastering query languages through professional training.To promote the efficiency of data access, structured knowledge grounding (SKG) systems ground user requests in structured knowledge and produce various outputs, including computer programs (e.g., SQL and SPARQL), table cell values, and natural language responses (Figure 1).For example, semantic parsing (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2005) converts natural language questions into formal programs; knowledge-base question answering (Berant et al., 2013) derives answers from tables or knowledge graphs.
SKG has attracted significant interest and has been studied through different tasks defined by different communities.Recent developments in tasks, models, and datasets for SKG have led to taskspecific modeling advances, making each task's progress seemingly unique and incompatible.A main reason is that SKG tasks are heterogeneous.Different types of structured knowledge, such as databases or knowledge graphs, lead to highly specialized encoders (Lin et al., 2019;Herzig et al., 2020;Wang et al., 2020;Yasunaga et al., 2021).Some SKG tasks, e.g., semantic parsing, use customized decoders to generate programs (Yin and Neubig, 2018;Ren et al., 2021).Therefore, instead of solving common challenges in SKG research, improvements in SKG have been prone to be exclusive to a single task, domain, or dataset.
In this paper, we propose the UNIFIEDSKG framework to advocate for a unifying view of 21 SKG tasks across six task families and multiple data domains (Table 1).UNIFIEDSKG standardizes datasets, models, code, experiments, and evaluation metrics into a single framework.By casting user requests, structured knowledge, and outputs 602 Structured Knowledge Grounding knowledge graphs web tables/pages databases/apps

Structured Knowledge
Greece held its last Summer Olympics in which year?

Question Answering
Describe the table result.

Data-to-Text Generation
Canada obtained 3 more gold medals than Mexico.

Fact Verification
I am looking for a cheap restaurant in the city center.
Book a table for 8 at 18:30 on Thursday.

Dialogs
Which players did win the Australian Open? into the text-to-text format (Raffel et al., 2020), it promotes model advances where new tasks can be framed with our standardized abstraction, and new models can be easily applied to diverse SKG tasks.
While previous works also cast SKG tasks into the text-to-text format (Hosseini-Asl et al., 2020;Shaw et al., 2021;Liu et al., 2021), their independent choices of pretrained language models (PLMs), input-output formats, and frameworks make our unification non-trivial.UNIFIEDSKG is easily extensible to more SKG tasks, and it is open-sourced to promote community-wide progress.
Using UNIFIEDSKG as a benchmark, we show that finetuning T5 (with constrained decoding or reranking when necessary) on individual tasks achieves state-of-the-art (sota) results on almost all of the 21 tasks, establishing a powerful and reproducible starting point for SKG research.T5 performance also increases with size on most tasks.
UNIFIEDSKG facilitates multi-task learning on SKG, enabling knowledge sharing and cross-task generalization.Although simple multi-task learning has mixed results, we show that multi-task learning with prefix-tuning (Li and Liang, 2021) benefits most tasks and largely improves the overall performance, on both T5-base and T5-large.
UNIFIEDSKG enables a series of controlled ex-periments on structured knowledge encoding.We find that T5 is sensitive to encoding variations, and the sensitivity varies across tasks.UNIFIEDSKG aims to facilitate more general and robust structured knowledge encoding methods.Finally, we conduct a comprehensive error analysis across SKG tasks.Although the errors made by PLMs decrease with the model size, T5-3B may still generate invalid outputs.
In summary, we 1) unify and benchmark 21 SKG tasks under the UNIFIEDSKG framework to evaluate diverse grounding goals and structured knowledge sources, 2) demonstrate (near) sota performance of T5 on all the unified SKG tasks, using a single, general-purpose approach, 3) show the benefit of knowledge sharing across SKG tasks via multi-task prefix-tuning, and 4) analyze recent modeling contributions (zero-shot, few-shot, and structured knowledge encoding) on these tasks.We hope UNIFIEDSKG enables the design of new models and learning algorithms that generalize to diverse SKG tasks and to identify their challenges.

Related Work
SKG with PLMs PLMs have been applied to several SKG tasks.To encode structured knowledge, prior work linearized the structured knowledge and concatenated it with the text (Hwang et al., 2019;Liu et al., 2020;Hosseini-Asl et al., 2020;Liu et al., 2021), which has been augmented by positional encoding (e.g., row/column embedding) (Herzig et al., 2020;Yin et al., 2020a) and template-based linearization (Chen et al., 2020a,b;Oguz et al., 2021), andplanning (Su et al., 2021).Recently, cell-column alignment is modeled by manipulating  (Pasupat and Liang, 2015) Table Question Answer CompWebQ (Talmor and Berant, 2018) Knowledge Graph Question Answer HybridQA (Chen et al., 2020c) Table + Text Passage Question Answer MultiModalQA (Talmor et al., 2021) Table + Text + Image Question Answer FeTaQA (Nan et al., 2021a) et al. (2020) and proposed to convert formal language into an English-like representation, decode with GPT-3, and map back to formal language automatically.We do not focus on these techniques in this work; instead, we unify all tasks and systematically compare them.and Wang et al. (2021a) unified few-shot learning as textual entailment.PLUR (Chen et al., 2021c) unified program learning, understanding, and repair tasks into a graph-to-sequence format.In this paper, we focus on the text-to-text format (Raffel et al., 2020) due to its flexibility.Different from unifying tasks that only take text as input, a core challenge in unifying SKG tasks into the text-to-text format is to linearize structured knowledge.Notably, Uni-fiedQA (Khashabi et al., 2020) unified QA tasks, while UNIFIEDSKG covers a broader scope of six task families for systematic exploration.
3 The UNIFIEDSKG Framework

Task Unification
The guiding principle of UNIFIEDSKG's task selection is diversity.We unify 21 SKG tasks across six task families and multiple domains (Table 1).
Our task families include: • Semantic parsing converts questions to logical forms (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2005).• Question answering derives answers to natural language questions based on structured data (Berant et al., 2013).
• Data-to-text generation describes structured data in natural language (Novikova et al., 2017).
• Fact verification checks if a statement is true based on the structured data (Chen et al., 2020b).
• Conversational tasks require understanding of not only the user's last request but also the full interaction history between users and machines (Budzianowski et al., 2018;Eric et al., 2019;Yu et al., 2019a).
• Formal language to text translation describes formal language in natural language (Chen et al., 2020d).
All these tasks take as input x a user request, a structured knowledge input, and an optional (dialogue) context to predict an output y. Figure 2 illustrates how we convert the input x to an input sequence x and the output y to an output sequence ỹ by means of "linearization" (Liu et al., 2021), enabling the unification of diverse forms of structured knowledge.We provide more details, examples, and input length analysis in the Appendices F and G. Our code implementation uses Hugging Face's Transformers (Wolf et al., 2020) and Datasets (Lhoest et al., 2021) toolkits.

Modeling
The simplest usage of UNIFIEDSKG is to train on individual tasks.In this case, we minimize the negative log-likelihood loss averaged over tokens in each batch.For decoding, we use beam search by default.UNIFIEDSKG also facilitates exploration of multi-task learning, few-shot, and zeroshot learning with PLMs, and details are presented in the corresponding parts in Section 4.

Results on Individual Tasks
We apply T5 models (Raffel et al., 2020) on each individual task in UNIFIEDSKG.For model training, we set the maximum number of epochs as 50-200, depending on the dataset size.We use early stopping and model selection on the development set.More details are shown in Appendix D.1.For each task, we report one commonly used metric in Table 2. See Appendix B for all metrics.
Comparison with previous sota Table 2 shows that vanilla T5-3B outperforms most previous sota models not trained on extra unsupervised indomain data.Some semantic parsing sota models, denoted as + in Table 2, are also T5 with constrained decoding (Scholak et al., 2021) or reranking (Ye et al., 2021b).This shows that a generalist architecture like T5, when scaled up to a certain size, can be as good as task-specific architectures for SKG, suggesting the potential of larger PLMs.Model scalability In general, T5 performance increases with the model size, but this trend varies across task families.Semantic parsing, QA, and fact verification tasks get large benefits from increased sizes, while text generation does not.See Section 4.5 for a human evaluation for text generation tasks.Also, the gap between T5-base (220M) and T5-large (770M) is larger than the gap between T5-large (770M) and T5-3B (3B).Effect of pretraining on structured knowledge Some smaller models pretrained on structured knowledge (Liu et al., 2021) show competitive performance as T5-3B, suggesting that pretraining with structured data is beneficial for SKG.This result calls for structured knowledge pretraining that generalizes to different SKG tasks across domains, which can be systematically explored using UNIFIEDSKG.Table 3: Comparison between T5-3B and T0-3B.T0-3B is initialized from LM-adapted T5 and further pretrained on a large number of non-SKG tasks.We finetune both models on individual tasks.T0-3B underperforms T5-3B on semantic parsing (Spider) and outperforms T5-3B on dialogue state tracking (MWoZ) and fact verification (TabFact).We report results on the dev.set.
Effect of pretraining on non-SKG tasks T0-3B (Sanh et al., 2021) is initialized from T5-3B and pretrained on multiple tasks that (in most cases) do not use structured knowledge as input (non-SKG tasks).Exploring the performance of T0-3B on SKG tasks helps us understand the relationship between SKG tasks and non-SKG tasks.Table 3 shows that T0-3B under-performs T5-3B on semantic parsing and outperforms T5-3B on dialogue state tracking and fact verification.We note that T0-3B is pretrained on dialogue QA, dialogue summarization, and NLI tasks; therefore, pretraining on non-SKG tasks might not be useful for SKG unless we add similar SKG tasks to pretraining.
2 For GrailQA and WebQSP, we run T5 and rerun the previous sota model (Ye et al., 2021b) using the gold entities.For

Multi-Task Learning
UNIFIEDSKG facilitates the exploration of multitask learning.In this part, we systematically study multi-task learning on all 21 unified tasks.We find that SKG benefits from multi-task prefix-tuning on both T5-base and T5-large, showing that the benefits from multi-task learning is scalable in terms of the model size.The baselines we use include: Single-task finetuning (ST-F), which is finetuning on individual tasks, same as Section 4.1.Single-task prefix-tuning (ST-P; Li and Liang, 2021), which learns lightweight task-specific pa-MultiModalQA and FEVEROUS, we report performance of T5 and the previous sota models on the dev.samples with at least one table (samples with image input are further excluded for MultiModalQA); The gold table and text candidates are used for both T5 and previous sota (for MultiModelQA, numbers are from (Yoran et al., 2021), and for FEVEROUS, we rerun the available model (Aly et al., 2021) on gold candidates to obtain the number).We use sacreBLEU to report all BLEU results.‡ We use gold entity linking, but the previous sota does not, which makes the results not directly comparable; therefore, we do not bold any numbers for CompWebQ and HybridQA.* T5-base with the independent output scheme (Lee et al., 2021) achieves 56.66 on MWoZ2.1, higher than our sequence output scheme.For WebQSP, as the original dataset does not have a dev.set, we split the original train set into in-house train/dev.sets (90%/10%), following prior practice (e.g.Ren et al. ( 2021)).Similarly, for CompWebQ, as the test set is not publicly available, we split the original dev.set into in-house dev./test sets (20%/80%).For GrailQA, we split the original dev.set into in-house dev./test sets (5%/95%).rameters while keeping the PLM fixed.We set the prefix length as 10.Clive et al. (2021) also used prefix-tuning on T5 for data-to-text generation.Multi-task finetuning (MT-F), which combines the training data of all tasks with temperature mixing (Raffel et al., 2020; after hyperparameter tuning with a few steps, we set the temperature as 2).We select model weights based on the average metric on all tasks' development set.
Table 4 shows that ST-P is comparable to ST-F on nearly all tasks.However, we find that it takes about 5-10 times as many training steps (See Appendix E), which is similarly observed for prompttuning (Lester et al., 2021).We also observe that MT-F leads to mixed results.For many tasks, MT-F is even worse than ST-F.Multi-task prefix-tuning (MT-P) Our explanation for the mixed results of MT-F is that the inputs of SKG tasks contain different structured knowledge from diverse domains, making it difficult to learn shared parameters effectively.To address this challenge, we first pretrain a prefix on all tasks, freezing T5 and using the same temperature mixing as MT-F.In the second step, we initialize each task's prefix with this pretrained prefix and optimize the prefix while freezing T5.This initialization step is similar to the prompt transfer explored in Vu et al. (2021).Following ST-P, we set the prefix length as 10.
Table 4 shows that multi-task prefix-tuning outperforms single-task finetuning and single-task prefix-tuning on most tasks, and it largely outperforms the naive multi-task learning baseline.It demonstrates that SKG tasks can be studied together to share data and knowledge.Exploring task knowledge transfer UNI-FIEDSKG facilitates studying knowledge transfer between SKG tasks.Given two tasks, task A and task B, we first train the model on task A and then continue training on task B. Table 5 shows that tasks benefit from other tasks with the same data source (e.g., tasks that all use Wikipedia tables as structured knowledge).We do not observe positive transfer between parallel tasks (e.g., semantic parsing tasks with different structured knowledge and different output) and subtask (e.g., question answering can be viewed as the execution semantic parses) when data sources are different.Compared to the positive results in Table 4, results in this part indicate that manually selecting source and target tasks may not be efficient for multi-task learning.by encoding a few training samples as "context" to learn without gradient updates.We use GPT-3 (Brown et al., 2020) and Codex (Chen et al., 2021a) to explore such few-shot learning for SKG.
To stay within our budget, for GPT-3, we report the performance on 100 random dev.set samples.We explore two settings for few-shot learning.
In the first setting, we randomly sample few-shot examples from the training set; these examples are shared by all dev.set samples, denoted as random in Table 6.For sequences that are too long for Codex (4096) and GPT-3 (2048), we use as many examples as possible and make sure that there is at least one example (truncated if needed).
In the second setting, we follow Gao et al. ( 2021) to select few-shot examples from the training set.We call this setting few-shot with example selection, denoted as select in Table 6.We use the pretrained SBERT (Reimers and Gurevych, 2020) for sentence embeddings of the user request input (for tasks that only have structured input, we embed the linearized structured input) and sample five most similar examples measured by cosine similarity.Further details (e.g., prompts and task instructions) are provided in Appendix D.4.SKG is challenging for zero/few-shot learning.Table 6 shows that zero-shot performance is very poor on most tasks (Spider and MultiWoZ are even 0).It also shows a large gap between fewshot learning and finetuning for Spider, WikiTQ, MWoZ, and TabFact, while the gap is smaller for generation tasks.For few-shot learning, example selection based on similarity outperforms random selection, but the gap is usually smaller than 10 points out of 100.It is also interesting to compare the results between synthesis tasks (Spider), which requires predicting programs, and induction tasks (WikiTQ and TabFact), where a model directly outputs answers (Devlin et al., 2017).We find that PLMs generally struggle more when adapting to induction tasks (e.g., close to random-guess on the binary classification task TabFact), reminiscent of recent attempts in program synthesis and induction using PLMs (Austin et al., 2021).For GPT-3 and Codex, better zero-shot performances can be expected by better prompt design.

Structured Knowledge Encoding
Structured knowledge encoding has been widely explored (Bogin et al., 2019;Lin et al., 2019;Agarwal et al., 2020;Saxena et al., 2020;Yasunaga and Liang, 2020;Yasunaga et al., 2022; and others detailed in Section 2).We hope that UNIFIEDSKG can promote systematic study of general structured knowledge encoding.To this end, this part focuses on the linearization of structured knowledge.Does the order of user input, structured knowledge, and context matter?To explore the effect of the order of user input, structured knowledge, and context, we rerun the single-task experiments while switching the order of these components in both the training and development set.Table 7 shows that placing the text before structured knowledge (rs) is better than the opposite (sr), which is consistent across SKG tasks.Our explanation is that the position of the text is relatively fixed in rs, Is it beneficial to represent structured knowledge as natural language?SKG data is not typically used to pretrain PLMs.Given ample training data, PLMs adapt well to SKG tasks, as shown in Table 2.However, under the low-resource setting, converting structured data to natural language might be helpful.For Spider, we use a shared template to convert structured data to natural language.For TabFact and WikiSQL, we randomly selected 236 tables shared by both datasets and manually labeled templates to convert each row into a sentence.Examples of the templates are shown in Appendix I.These templates produce about 1000 samples for each task, divided into training and test sets.We find that, in WikiSQL, the conversion to natural language stabilizes and accelerates the training process.Table 9 shows that conversion to natural language improves the performance on WikiSQL, has no significant influence on TabFact, and slightly degrades the performance on Spider.

Human Evaluation for Generation Tasks
For each generation task, we randomly sample 100 development set samples and ask human annotators to judge the correctness of each output, using a 0-1 score.Details are provided in Appendix D.5.  10 shows that automatic metrics do not always reflect human evaluation, calling for better automatic metrics to truly reflect the model's ability on generation tasks.Larger models are not always better, and detailed error analysis is provided below.

Error Analysis
Error analysis based on output validity Unconstrained decoding from PLMs may generate invalid outputs.For semantic parsing, we divide wrong outputs into invalid outputs (i.e., not executable when the output is SQL, and not parse-able when the output is s-expression or TOP-representation) and valid but wrong answers.Figure 3 shows that, for SQL semantic parsing, a large number of errors are caused by invalid outputs, and the number of invalid outputs gradually decreases with the increase of model size.This phenomenon is also observed by Scholak et al. ( 2021), who used constrained decoding to improve the validity, largely improving the parsing performance.For s-expression semantic parsing, invalid outputs take up 30-50% of all wrong outputs, and increasing the model size does not reduce invalidity significantly.For fact verification tasks, valid outputs are "entailed" and "refuted".We observe that T5 always generates valid outputs.For question answering, we do not include the validity analysis since the validity check for an answer is non-trivial and could be imprecise.
Error analysis for text generation tasks For generation tasks, we consider four types of errors: missing information (required information is not Halluc. Ungram. (d) ToTTo Figure 3: Error analysis.For semantic parsing, we plot the number of invalid/valid-but-wrong predictions.
For generation, we plot the proportion of missinginformation/contradiction/hallucination/ungrammatical errors among all predictions (one prediction may have multiple errors).Full visualization is in Appendix B.
shown in the output), contradiction (the output is contradictory to the input), 3) hallucination (the output contains information that cannot be verified by the input), and 4) ungrammatical.Figure 3 shows that the proportion of ungrammatical outputs is generally less than 5%.Missing information and contradiction are common errors made by T5, and performance gains generally come from reducing contradiction.Hallucination is not a common error made by T5 except for the highlighted-table-to-text task (ToTTo), where T5 tends to output information of non-highlighted cell values.
Case study We summarize some interesting observations about the model output (more in Appendix H).Compared with T5-base and T5-large, T5-3B's outputs for text generation tasks tend to be more diverse and creative as shown in Appendix H.2 and H.7. Also, T5-3B sometimes leverages domain knowledge to summarize facts in some tasks such as DART (e.g., describing rating 5 out of 5 as low), while the other two copy the original expressions in the input, as shown in Appendix H.5 and H.6.However, this ability puts T5-3B in the risk of manipulating information and meaning of user request as shown in Appendix H.3.2 and H.4.

Conclusions
In this paper, we propose the UNIFIEDSKG framework to promote systematic research on struc-tured knowledge grounding by unifying 21 SKG tasks.Using UNIFIEDSKG as a benchmark, we demonstrate that finetuning T5 on individual tasks achieves state-of-the-art results on almost all 21 tasks.We show that multi-task prefix-tuning benefits most SKG tasks, largely improving the overall performance.For structured knowledge encoding, we find that the effectiveness of encoding variations varies across tasks.Moreover, UNIFIEDSKG is a challenging testbed for zero-shot and few-shot learning, shown by the poor results of large PLMs.

Limitations
UNIFIEDSKG establishes a powerful and reproducible starting point for SKG research.New models can be easily applied to diverse SKG tasks, and new tasks can be easily framed based on our standardized abstraction.UNIFIEDSKG promotes a systematic study on more general and robust advances in structured knowledge encoding, multitask learning, zero-shot learning, and few-shot learning for SKG tasks.It also would be interesting to explore general pretraining methods within UNIFIEDSKG, which potentially benefit all the unified tasks.When the structured knowledge is too large for GPU memory, we truncate them based on heuristic rules, calling for future study on 1) incorporating retrieval component in SKG, 2) designing sparse attention in T5 for structured knowledge or other means to improve model efficiency.
UNIFIEDSKG currently provides the correct type of structured knowledge for each task.However, how a system searches for the correct structured knowledge resources, takes appropriate action, and integrates information and results from multiple structured sources given a user request is still underexplored, which are a prerequisite for building a unified multi-purpose SKG system.
Since we select popular tasks from each task family, we risk disproportionality in terms of the data language, domain and population, and we actively welcome diverse, multi-lingual tasks to be added into UNIFIEDSKG.Also, the error analysis of SKG can more fine-grained, and we hope our findings can promote future work on systematically studying and decomposing the behavior of PLMs on SKG tasks.Furthermore, training and evaluation data should reflect the intents and linguistic phenomena in the real world (de Vries et al., 2020), suggesting more realistic tasks to be added into UNIFIEDSKG.

A Contributions
Code implementation Tianbao Xie and Chen Henry Wu implemented the code base of the UNI-FIEDSKG framework and experiment pipeline.The code of PICARD and advice from Torsten Scholak sped up the implementation.
Task unification Tianbao Xie, Peng Shi, Michihiro Yasunaga, Chen Henry Wu, and Ming Zhong implemented the 21 tasks into the text-to-text format, adapted the metrics, and verified the performances.

Paper writing
Chen Henry Wu and Tianbao Xie finished most part of the paper.Michihiro Yasunaga, Peng Shi, and Chengzu Li added results and analysis for their corresponding parts.Peng Shi drafted related work on SKG with PLMs.Torsten Scholak, Pengcheng Yin, Rui Zhang, Ruiqi Zhong, Victor Zhong, Michihiro Yasunaga, Connor Boyle, Chien-Sheng Wu, Sida Wang, Bailin Wang, Ansong Ni, Ziyu Yao, Lingpeng Kong, Caiming Xiong, Dragomir Radev, Noah A. Smith, and Luke Zettlemoyer carefully reviewed the paper and gave feedback for multiple rounds.

Experiments
Chen Henry Wu, Tianbao Xie, and Chien-Sheng Wu conducted experiments on individual tasks and multi-task learning.Tianbao conducted the zero-shot learning experiments.Chengzu Li and Tianbao Xie conducted the fewshot learning experiments.Tianbao Xie conducted experiments on the ordering of sequence inputs and order-sensitivity.Chengzu Li, Connor Boyle, and Peng Shi conducted the experiments on converting structured knowledge into natural language.Human evaluation Chen Henry Wu organized the human evaluation.Torsten Scholak, Rui Zhang, Chengzu Li, Connor Boyle, Tianbao Xie, Peng Shi, Tao Yu, and Chen Henry Wu were the human participants.

Error analysis and case study
Tianbao Xie, Chen Henry Wu, and Michihiro Yasunaga designed and conducted the error analysis for semantic parsing and generation tasks.Authors who participated in the human annotation selected the cases for case study.Discussion We had three separate weekly meetings, and everyone in the project attended one of them.Torsten Scholak, Ruiqi Zhong, Pengcheng Yin, Victor Zhong, Peng Shi, Rui Zhang, Sida Wang, and Lingpeng Kong actively provided advice.Torsten Scholak provided signals that prefixtuning would be comparable to fine-tuning.Ruiqi For the KVRET dataset, instead of the version used in our main tables, we re-run another more widely used pre-processed version (Madotto et al., 2018;Wu et al., 2019;Qin et al., 2020) on T5-base, T5-large and T5-3b.Results are shown in Table 13.

C Input and Output Length Analysis
Linearization of large structured knowledge input (e.g., large tables and KGs) can be arbitrarily long, which needs to be truncated to fit in GPUs with a limited size.The input and output are tokenized by T5Tokenizer in Huggingface's Transformers. 3e visualize the length distribution in Figure 5, and details are presented in Table 14.Among the datasets with very long inputs, we choose Wik-iTableQuestion to study the impact of input length.We visualize the table length distribution and performances with different input truncation lengths in Figure 6.We observe that the accuracy increases as the input becomes longer, motivating future work to study how to effectively encode large structured input, e.g., leveraging sparse attention (Zaheer et al., 2020).point selection.For all tasks, we set learning rate to 5e-5 and used linear learning rate decay.All experiments are done on NVIDIA Tesla V100 and NVIDIA Tesla A100.

D.2 Metric Details
For most semantic parsing tasks, we report the exact match accuracy of logical forms, and for task has test suite (Zhong et al., 2020), we add test suite metric to represent model's performance; an exception is WebQSP, for which we follow previous work to execute the parses and report the F1 score.
For QA, we report the exact match accuracy of answer sets.For data-to-text generation, we re-port sacre-BLEU (Post, 2018). 5We use each task's representative metric used by previous works.For fact verification, we report the accuracy.For highfidelity NLG, we report BLEC (Shu et al., 2021), which is the exact match between keywords in the formal language and the natural language.Unless specified, we use T5-large and report the development set performance.

D.3 T0 Zero-shot Experimental Details
For each task in UNIFIEDSKG we search Sanh et al.

Figure 1 :
Figure 1: Structured knowledge grounding (SKG) leverages structured knowledge to complete user requests.By casting inputs and outputs into the text-to-text format, UNIFIEDSKG standardizes datasets, models, code, experiments, and metrics for 21 SKG tasks.
Recent years witnessed the trend of unifying related but different tasks into a shared format.McCann et al. (2018) unified various tasks as question answering.Yin et al. (2020b)

Figure 2 :
Figure 2: We unify SKG tasks with heterogeneous inputs and outputs into the text-to-text format.

Table 1 :
We unify 21 SKG tasks with different knowledge input, user input, and output, covering six task families.
Wang et al. (2021b);Eisenschlos et al., 2021) al., 2020;Eisenschlos et al., 2021).Hierarchical encoding is another way to represent the structure, e.g.,Wang et al. (2021b)used tree-based transformers to represent the structure of the tables;Iida et al.  (2021)used transformers to encode row and column representations; Chen et al. (2021b) used hierarchical transformers to encode KG triples.SKG's outputs include, but are not limited to, structured meaning representations (e.g., logic forms, SQL), dialogue states, natural language, answer sets, and Boolean values.Among them, structured meaning representation is challenging for PLMs because they are originally trained on natural language.To bridge this gap, Shin et al. (2021) adopted the insights fromBerant and Liang (2014)and Marzoev

Table 2 :
Test or development (dev.)set performance of models trained on individual tasks.Vanilla T5 or T5 with simple modifications (e.g., + constrained decoding or reranking) achieve sota on nearly all tasks.The best result without extra pretraining is shown in bold.More detailed results and result variances can be found in Tables 11 and 12 in Appendix.Human evaluation for generation tasks is in Section 4.5.w/ (w/o) extra means with (without) extra pretraining on unsupervised structured data (e.g., web tables). 2 Spider WikiTQ DART MWoZ TabFact SQL2Text T5-3B 71.76 50.65 50.38 58.

Table 4 :
Multi-task learning results.ST and MT stand for single-task and multi-task.F and P stand for finetuning and prefix-tuning.For total parameters, T and P are the numbers of T5 and prefix parameters (P T ).Multi-task learning with prefix improves the performance on most tasks, largely improving the overall performance.We report results on the dev.set.

Table 5 :
Task knowledge transfer.We use T5-large here.B only means training the model on task B; A to B means to train the model on task A and then to finetune the model on task B. In both settings, we report task B's development set performance.We find that tasks benefit from other tasks with the same data source.T0 (Sanh et al., 2021)to create similar natural language instructions for the unseen tasks.Our instructions are provided in Appendix D.3.Few-shot learning settings Brown et al.(2020)showed that large PLMs could be few-shot learners

Table 7 :
Ordering of inputs.Subscripts show the standard deviation with three runs.s, r, and c stand for the structured knowledge, request input, and context.Placing r before s is always better, and placing c between r and s is better for dialogue state tracking (Mul-tiWoZ2.1).

Table 8 :
Order-sensitivity of structured knowledge.Subscripts show the standard deviation with three runs.Same Order is the default benchmark setting.Reversed Order means to reverse the structured knowledge ordering on the development set (but not the training set).Tasks with cross-domain tables (in WikiTQ), databases (in Spider), and triples (in DART) are less order-sensitive, while pre-defined ontology (in Multi-WoZ2.1) is highly order-sensitive.
Order-insensitivity is common for most structured knowledge, e.g., permutation of columns in a table preserves the meaning.To study this insensitivity, we evaluate T5-large on a manipulated development set where the order of schema (for database), column (for table), or slots and values (for ontology) is reversed.Table8shows that tasks with cross-domain tables and databases are less order-sensitive, while models are very sensitive to the order of ontology.Other types of robustness (e.g., robustness to cell values irrelevant to the answer) remain an open question in UNIFIEDSKG.

Table 10 :
Automatic metrics and human evaluation on the development set of generation tasks.* p < 0.05 for "the rank-1 model is better than the rank-2 model".† p < 0.05 for "the rank-2 model is better than the rank-3 model".Automatic metrics do not always reflect human evaluation.Larger models are not always better.
Chien-Sheng Wu, Richard Socher, and Caiming Xiong.2019.Global-to-local memory pointer networks for task-oriented dialogue.In Proceedings of the International Conference on Learning Representations (ICLR).

Table 11 :
Zhong gave advice on analyzing the effect of model size, Pengcheng Yin and Peng Shi gave advice on analysis on converting structured knowledge into natural language.Pengcheng Yin helped interpret experimental results.Ziyu Yao suggested that we report both sota (w/ extra) and sota (w/o extra) for a fair comparison.Victor Zhong and Bailin Wang gave valuable suggestions on multi-task learning and task transfer analysis.Luke Zettlemoyer, Noah A. Smith, Caiming Xiong, and Dragomir Radev gave valuable comments on research questions and experimental design.Computing resources We thank Salesforce Research, an Amazon Research Award, ServiceNow Research, and Yale NLP for providing computing resources generously.Tao Yu designed and led the research.Development set performance with full metrics.We do three experiments with different random seeds on representative task of each family and report their averages and standard variances format as avr var .

Table 12 :
Test set performance with full metrics (for tasks with a publicly available test set).We do three experiments with different random seeds on representative task of each family and report their averages and standard variances format as avr var .

Table 14 :
Input and output length for each task's train set.

Table 15 :
Input and output length for each task's development set.

Table 17 :
Hyperparameters for each SKG task.-pricerange: cheap, dontcare, expensive, moderate; hotel-type: guesthouse, hotel; hotelparking: dontcare, free, no, yes; hotel-book day: friday, monday, saturday, sunday, thursday, tuesday, wednesday; hotel-book people: 1, 2, 3, F.2 Linearization • Tables.Following Liu et al. (2021), we linearize the table into a sequence.By inserting several special tokens to indicate the table boundaries, a linearized table can be represented as "col: c 1 , ..., c N row 1 : r 1 row 2 : r 2 ... r M ", N and M are the number of columns and rows.F.3 Output FormatWhen the output is natural language or formal language we do not modify it because it is already in sequence format; a set of answers, we use a comma followed by a space to join the answers; a Boolean value, we map True to "entailed" and False to "refuted"; a dialogue state, we follow HosseiniNoneSequence Output: hotel , yes; hotel-name: none; train-destination: none; trainday: dontcare, friday, monday, saturday, sunday, thursday, tuesday, wednesday; train-departure: none; train-arriveby: none; train-book people: 0, 1,10, 15, 2, 3, 4, 5, 6, 7, 8, 9;taxidestination: none; taxi-departure: none; taxileaveat: none; train-leaveat: none; attractionarea: cambridge, centre, dontcare, east, north, south, west; restaurant-pricerange: cheap, dontcare, expensive, moderate; restaurant-area: centre, east, north, south, west; restaurantfood: none; attraction-name: none; restaurantname: none; attraction-type: architecture, boat, church, cinema, college, concerthall, entertainment, hotspot, multiple sports, museum, nightclub, park, special, swimmingpool, theatre ; restaurant-book day: friday, monday, saturday, sunday, thursday, tuesday, wednesday; restaurant-book people: 1, 2, 3, 4, 5, 6, 7, 8; restaurant-book time: none; taxi-arriveby: none; was successful .reference number is : bmukptg6 .can i help you with anything else today ?| friday and can you book it for me and get a reference number ?| what day will you be staying ?| could you book the wartworth for 1 night , 1 person ?| if you would like something cheap , i recommend the allenbell .for something moderate -ly priced , i would recommend the warkworth house .| that does not matter as long as it has free wifi and parking .| i can help you with that .what is your price range ?| i need to book a hotel in the east that has 4 stars .| none Sequence Output: hotel pricerange none, hotel type none, hotel parking yes, hotel book day friday, hotel book people 1, hotel book stay 1, hotel area east, hotel stars 4, hotel internet yes, hotel name wartworth, train destination bishops stortford, train day friday, train departure cambridge, train arriveby none, train book people none, taxi destination none, taxi departure none, taxi leaveat none, train leaveat none, attraction area none, restaurant pricerange none, restaurant area none, restaurant food none, attraction name none, restaurant name none, attraction type none, restaurant book day none, restaurant book people none, restaurant book time none, taxi arriveby none : poi | poi_type | address | distance | traffic_info row 1 : sigona farmers market | grocery store | 638 amherst st | 3 miles | car collision nearby row 2 : cafe venetia | coffee or tea place | 269 alger dr | 1 miles | car collision nearby row 3 : 5672 barringer street | certain address | 5672 barringer street | 5 miles | no traffic row 4 : valero | gas station | 200 alester ave | 2 miles | road block nearby row 5 : stanford childrens health | hospital | 899 ames ct | 5 miles | moderate traffic row 6 : palo alto garage r | parking garage | 481 amaranta ave | 1 miles | moderate traffic row 7 : teavana | coffee or tea place | 145 amherst st | 1 miles | road block nearby row 8 : willows market | grocery store | 409 bollard st | 5 miles | no traffic Request Input: ok, please give me directions via a route that avoids all heavy_traffic.
i am looking to book a train that is leaving from cambridge to bishops stortford on friday .Context:booking col