Generating Data for Symbolic Language with Large Language Models

While large language models (LLMs) bring not only performance but also complexity, recent work has started to turn LLMs into data generators rather than task inferencers, where another affordable task model is trained for efficient deployment and inference. However, such an approach has primarily been applied to natural language tasks and has not yet been explored for symbolic language tasks with complex structured outputs (e.g., semantic parsing and code generation). In this paper, we propose SymGen which utilizes LLMs for generating various annotation-expensive symbolic language data. SymGen consists of an informative prompt to steer generation and an agreement-based verifier to improve data correctness. We conduct extensive experiments on six symbolic language tasks across various settings. Compared with the LLMs, we demonstrate the 1\%-sized task model can achieve comparable or better performance, largely cutting inference and deployment costs. We also show that generated data with only a few human demonstrations can be as effective as over 10 times the amount of human-annotated data when training the task model, saving a considerable amount of annotation effort. SymGen sheds new light on data generation for complex tasks, and we release the code at \href{https://github.com/HKUNLP/SymGen}{https://github.com/HKUNLP/SymGen}.


Introduction
In the natural language processing (NLP) literature, the march of scaling language models has been an unending yet predictable trend, with new models constantly surpassing previous ones in not only performance but also complexity (Radford et al., 2019;Brown et al., 2020;Chowdhery et al., 2022).Such large language models (LLMs), however, incur a large computational cost in practice, especially when deployed in resource-restricted systems and inference in low-latency applications (Bommasani Input: List the type of bed and name of all traditional rooms.Output: SELECT roomName,bedType FROM Rooms WHERE decor = "traditional" Input: How many large metallic items are there?Output: 1#) return items 2#) return #1 that are large 3#) return #2 that are metallic 4#) return number of #3 Figure 1: Sample symbolic language datasets with complex structured outputs.The names of the symbolic languages are shown in square brackets.et al., 2021).Instead of treating LLMs as edge task inferencers, a recent line of work leverage LLMs as data generators, with the generated data being used to train more affordable task-specific models for efficient deployment and inference (Schick and Schütze, 2021;Meng et al., 2022;Ye et al., 2022b, inter alia).With only a few or even without demonstration examples, the LLMs can generate highquality data via in-context learning (Brown et al., 2020) or prompting (Radford et al., 2019).The task models trained on these generated data can achieve comparable or even better performance than the LLMs and enjoy a low inference cost at the same time.
However, previous work mainly focuses on generating natural language data.To what extent this approach works for complex structured data, such as meaning representation and codes (Figure 1), remains an open question.The investigation of data generation via LLMs in the context of such symbolic language tasks is also extremely intrigu-ing for two reasons: 1) the human annotation procedure for these tasks requires expensive domain expert efforts (Clarke et al., 2010) and carefullydesigned strategies (Wang et al., 2015;Iyer et al., 2017;Herzig and Berant, 2019, inter alia); 2) conventional data augmentation methods aiming at enriching datasets for these tasks require handcrafted rules, a considerable number of expert demonstration examples, and are mostly task-specific (Jia and Liang, 2016;Yu et al., 2018a;Andreas, 2020, inter alia).
To address these issues, we propose Symbolic data Generation (SYMGEN) for various annotationexpensive symbolic language tasks.SYMGEN works with an LLM trained on code (i.e., Codex-175B;Chen et al. 2021) and optional task-specific structure knowledge (e.g., database for SQL; Iyer et al. 2017) through prompting or in-context learning.SYMGEN also comprises an agreement-based verification module, in which the outputs are verified by execution (e.g., programs, logical forms) or formatting (e.g., pseudo-logical forms), to ensure high-quality generations.With the generated data, we train efficient task models with around 1% size of Codex for task inference (e.g., T5 with size 770M and 3B; Raffel et al. 2020).

Symbolic Knowledge
Natural Instruction

Generated Example
Overall Prompt Figure 2: An example of an overall prompt that consists of symbolic knowledge, natural instruction and demonstrations, and a newly generated example by Codex for the GeoQuery dataset.
• During data generation for symbolic language, symbolic knowledge and demonstrations have a greater impact than natural language instructions, and verification is essential to ensure data quality ( §3.7, §4.2).

SYMGEN
To automatically curate numerous data for various annotation-expensive symbolic language tasks, we propose a unified pipeline, named SYMGEN.SYMGEN comprises data generation by prompting LLMs and data verification by executing or formatting.

Prompt-based Generation
Human annotators need to carefully review the annotation instructions to perform annotation, and the same is true for LLMs.We include natural language instructions, task-related symbolic knowledge (i.e., database, ontology), and a few labeled examples into prompt construction to steer the generation.An example of the prompt is shown in Figure 2. We display prompts for each task in Appendix G. Different from previous work in classification (Schick and Schütze, 2021;Meng et al., 2022;Ye et al., 2022b) where one of the limited label descriptions is used to guide the generation of x i for classification tasks, the output structures for symbolic language is unenumerable and requires task-specific strategies to construct.Hence, we first generate input x i and then the output y i conditioned on the generated x i .LLMs may generate erroneous outputs due to not satisfying different grammatical constraints defined in different symbolic languages.Therefore, we over-generate multiple candidates for further verification.After prompt-based generation, we have a dataset D = {(x i , {y i,j })} for each task.

Agreement-based Verification
In this work, we adopt an over-generation and verification approach to improve generation quality.Formally, given a set of sampled output answers {y i,j } for input x i , we verify each answer y i,j by calculating: where exec = (•) is a task-specific execution or formatting function (e.g., executing python program, formatting QDMR into graph representation), sim = (•, •) ∈ [0, 1] is a similarity function to compare the two results after running function exec.A large value of w i,j indicates the j-th answer is highly similar to others, and is thus less prone to be mistakenly labeled.The value of the most confident answer w i = max w i,j is used to measure the quality of the input-output pair, and we only keep those with w i larger than a certain threshold T , indicating the input-output pair is sufficiently confident.
In practice, when performing exec, we discard y i,j that fails in exec, which means it contains grammatical errors.When using Exact-Match (EM) as the similarity function, the similarity score ranges in {0, 1}, with 1 indicating that the two execution results are exactly the same.If multiple answers have the same value, we choose the answer that has the maximum log-likelihood during generation.

Datasets and Evaluation Metrics
We consider five datasets that cover a range of programming languages and symbolic meaning representations: Spider (SQL; Yu et al. 2018b), NL2Bash (Bash;Lin et al. 2018), MBPP (Python; Austin et al. 2021), MTOP (TOP-representation; Li et al. 2021) and Break (QDMR; Wolfson et al. 2020).We summarize the choice of the execution or execution function exec, similarity function sim, and evaluation metrics for each dataset in Table 1.Details of the datasets and evaluation metrics are illustrated in Appendix A.

Comparison Methods
We generate data under various settings such as zero-shot and few-shot, and then train task models, e.g., T5-large and T5-3B, for inference.We compare the performance of the task models with both LLM inferencers and the task models that are directly finetuned with human-annotated data rather than LLM-generated data: • Codex (Chen et al., 2021).The tuningfree method that performs prompt-based incontext learning with Codex.Due to the length restriction, we perform prompt retrieval to include as many similar examples as possible in the full data setting.• Codex + Verification.The method is similar to the above but further includes the answer verification module as discussed in § 2.2.• T5-Large (Raffel et al., 2020).The tuningbased method that directly fine-tunes T5-large model with few or full human-annotated data instead of generated data.• T5-3B (Raffel et al., 2020).The same method as the one above, but using a T5-3B model.

Implementation Details
We use code-davinci-002 version for Codex and T5-large and T5-3B for task models.Details are elaborated in Appendix B.

SYMGEN + T5 vs. Codex Inferencer
In this section, we consider generating data in fewshot and full data settings and report the model performance in Table 2. Firstly, we can see the performance of T5-Large consistently increases after adding data from SYMGEN on all tasks.Notably, we can achieve an on-average 40% performance boost in the few-shot setting.Secondly, though prompting-based inference has become the de-facto standard to use LLMs on downstream tasks, we find use LLMs as data generators and training a much smaller task model can achieve comparable (e.g., Spider and NL2Bash) or better (e.g., Geoquery, MTOP, and Break) performance.The reasons can be twofold: 1) As recent work proves in-context learning is an extreme approximation of fine-tuning with a single-step gradient descent (von Oswald et al., 2022;Dai et al., 2022), LLM inferencer fails in utilizing the valuable human annotations, even with prompt retrieval.For example, Codex can surpass T5-3B on Spider in a few-shot setting but cannot in full data setting; 2) the obtained knowledge (i.e., generated data) from interacting with the verifier is not explicitly learned by the LLMs, mean-ing it never learns to correct its own mistakes, and such knowledge also improves LLMs themselves as shown in Haluptzok et al. (2022).In comparison, the task model can learn from those successful interactions.Finally, we find an exception on MBPP, where Codex inferencer significantly outperforms T5, indicating that long-code generation is still challenging for small-sized models.

SYMGEN vs. Human Annotations
A key benefit of SYMGEN is reducing annotation effort when training a task-specific model.We show the performance of the trained T5-large model under various scales of human-annotated and few-shot generated data by SYMGEN in Figure 3.When using human annotations, the model performance grows linearly with exponentially increased data size, which mirrors the power-law in neural models (Kaplan et al., 2020) the green horizontal line, significantly outperforms the model trained solely on these 10 given data points.Moreover, the intersection point of the horizontal and vertical lines indicates the performance achieved by training the model on the data generated by SYMGEN is comparable to that on at least 100 (e.g., MTOP) and up to several thousand (e.g., Spider) human-labeled data.This shows the potential of SYMGEN to greatly reduce the annotation effort on complex tasks.

SYMGEN for Zero-shot Learning
Given the striking ability of SYMGEN in few-shot data generation, we take a step forward to see whether it can generate high-quality dataset without any human annotations.We found it hard to control the format in generating most symbolic languages without demonstrations, but we succeed in generating SQL, as shown in Table 3.We find with appropriate prompt and verification, one can achieve a high zero-shot performance of 67.21, outperforming the supervised T5-Large model.We also note that the EM metric is much lower than that of the T5   et al., 2020;Carlini et al., 2021;Rajkumar et al., 2022).Based on the much lower EM accuracy, we attribute the success of zero-shot learning to prompt engineering rather than memorization.Moreover, it has been shown that adapting to the new environment significantly outperforms data augmentation in the training environment by Zhong et al. (2020b).Given no human-annotated data on the development environment, we further generate data for those 20 databases as surrogate knowledge for adaptation.We can see the results significantly increase after training on those additional data, and even outperform the large Codex as well as the human-supervised T5-Large model, indicating SYMGEN can be used for zero-shot adaptation for specific symbolic languages such as SQL.

Prompt Engineering in SYMGEN
Recent work highlights PLM sensitivity to the natural instructions (Zhao et al., 2021;Liu et al., 2022;Gao et al., 2021).In this section, we study the influence of symbolic knowledge (e.g., database and ontology), natural instruction, demonstration, and language reformulation on answer generation.An example of these four types of information in prompts is shown in Figure 2. We report the results of removing certain types of information in Table 4. Removing symbolic knowledge or demonstrations has a greater impact on the answer quality than natural instructions, suggesting symbolic language prediction benefits more from the provided symbolic knowledge and exemplar pairs.An exception is on Spider where removing demonstrations slightly hurt performance, which is mainly because Spider is a cross-domain dataset and the provided few-shot examples are from different domains (see example in Figure 10).
As also discussed in §3.4 that Codex is more familiar with SQL than Prolog, we further experiment on GeoQuery-SQL dataset (Iyer et al., 2017) which converts Prolog commands to SQL commands.We show a comparison of the two prompts in Appendix Figure 21.We found altering Prolog to SQL in prompts increases the performance dramatically, indicating aligning the expression of prompts with pre-training corpus can be another effective way of prompt engineering.

How does SYMGEN compare with data augmentation methods?
For training a better task model for symbolic language tasks, data recombination (Jia and Liang, 2016) has been the common choice due to its compositional characteristics.We further compare SYMGEN with two competitive baselines for semantic parsing: Jia and Liang (2016) which uses an SCFG induced by domain-specific heuristics, and Andreas (2020) which compositionally reuses pre- viously observed sequence fragments in novel environments.We generate 1,000 instances for each method and report the results in Table 5.We can see SYMGEN provides a larger boost, especially in the few-shot setting, where Andreas (2020) failed due to the lack of initial seed data.

How does the verification method affect performance?
We now investigate the effectiveness of the verification method discussed in §2.2. Figure 4 (a) shows various answer verification methods, compared with picking the top-likelihood candidate without verification.We observed that verifying based on agreements of self-generated candidates (sim(•, •)) surpasses the without-verification baseline, and also improves answer quality on all the tasks more than simply checking grammar correctness (exec(•)).Besides answer verification, we also show filtering low-confidence questions in Table 7, where the model trained on a much smaller size of data can outperform the one trained on the original data.This further indicates that low-quality data can interfere with the training process.

How does a different number of human annotations affect SYMGEN?
By far we have compared the few-shot results of Codex with in-context learning and T5-large with SYMGEN using 10 human annotations.In this section, we experiment with various amounts of human annotations and report the results in Figure 4 (b).We found the gap in performance between Codex and T5-Large remains virtually unchanged, which indicates the performance gain  obtained from pipeline alternation (i.e., from incontext learning to data generation and supervised tuning in SYMGEN) maintains as the size of human annotations grows.This further proves that one can always apply SYMGEN in different real scenarios from little to relatively more annotated data.

Data Analysis
We further conduct statistical and human evaluations on the quality of generated data from the perspective of question diversity, answer complexity, and data pair quality, based on the generated data for MBPP in the full data setting and Spider in the few-shot setting.
Question Diversity We measure the question diversity of the generated data for Spider and MBPP by question length and question distribution.As shown in Figure 5 (a), we find that the questions generated by SYMGEN are distributed similarly to the original dataset with more coverage.We also find the average length of the generated questions is longer than the original dataset for Spider but similar for MBPP as shown in Appendix E.1.

Answer Complexity
We first measure the complexity of answers based on their response lengths.For Spider, as shown in Figure 5  the original dataset.Moreover, we measure the answer by their hardness, which is defined by the number of keywords following (Yu et al., 2018b However, there are mainly three issues that exist in the data generated by SYMGEN in both MBPP and Spider.First, SYMGEN may generate ambiguous and under-specified questions (examples in Appendix E.3).Secondly, the answers sometimes can be meaninglessly complex.In Spider, SYMGEN tends to generate SQL queries with multiple JOIN clauses, therefore making the response sequences longer compared to the original dataset.Similarly.the generated Python codes tend to use for-loop and recursion instead of the built-in functions of Python (e.g.max, min).Thirdly, it can be difficult to verify the correctness of the generated answers based on either the original databases in Spider or the test cases that are generated along with Python solutions for MBPP.A quarter of the generated SQL queries have empty execution results on the original databases of Spider and more than 10% of the generated python codes have wrong test cases.We hope these could help to shed light on possible improvements for future works.

Prompting LLMs
In recent years, large pre-trained language models (LLMs) have shown promising performance on zero-shot and few-shot learning tasks by promptbased in-context learning (Radford et al., 2019;Brown et al., 2020, inter alia).By explicitly curating to include code (programming language) into pre-training corpora (Wang, 2021;Chen et al., 2021;Chowdhery et al., 2022, inter alia), LLMs exhibit surprising ability in symbolic tasks such as semantic parsing (Shin and Van Durme, 2022) and code generation (Austin et al., 2021;Poesia et al., 2021;Rajkumar et al., 2022).Nevertheless, prompt-based inference with LLMs suffers from several problems including low inference efficiency and expensive deployment cost.In this work, we employ LLMs as data generators rather than direct inferencer, which generate supervised data with minimal human effort to improve the performance of much smaller models for efficient inference on downstream tasks.

Data Generation
Data generation is an alternative to data augmentation by creating entirely new examples instead of combining original ones (Jia and Liang, 2016;Andreas, 2020, inter alia) (see Appendix F for details).Conventional approaches adopt fine-tuned generative models (Zhong et al., 2020b;Guo et al., 2021;Wang et al., 2021a, inter alia) as input generators, with a semantic parser (e.g., PCFG grammar) for sampling symbolic outputs.Considering the difficulty in designing grammar to sample useful symbolic forms in complex domains, Yang et al. (2022) assumes access to an unlabeled corpus of symbolic language, which is represented in canonical forms, and simulates natural language inputs via LLMs.In comparison, we explore directly generating symbolic forms as well as natural languages without the need to design task-specific grammars for symbolic forms or synchronous context-free grammars (SCFG) that map between canonical forms and symbolic forms.Data generation via LLM has also been explored under various contexts, e.g., cross-lingual semantic parsing (Rosenbaum et al., 2022), python program (Haluptzok et al., 2022), instruction generation (Wang et al., 2022), and multimodal tasks (Liu et al., 2023;Pi et al., 2023), in contrast, we aim to unify the data generation procedure for various symbolic languages tasks.Furthermore, for simple classification tasks, it has been found a smaller model trained on data generated with a few or even zero human demonstrations can achieve better performance than the LLMs (Schick and Schütze, 2021; Meng et al., 2022;Ye et al., 2022b,a;Gao et al., 2023).This work fills in the gap by exploring such an approach to complex symbolic language tasks.

Conclusion
In this work, we treat LLMs as data generators rather than task inferencer for complex symbolic language tasks, with the generated data being used to train much affordable model for depolyment and inference.We demonstrate that a 1%-sized model trained under SYMGEN can achieve superior performance to the LLM inferencers.We especially show the effectiveness in low-resource scenarios, which is a common situation for symbolic language tasks due to the annotation-expensive characteristics.Additionally, we also reveal the possibility of obtaining a well-performed task model through SYMGEN even without any human annotations.

Limitations
This work is based on prompting and in-context learning with informative prompts for symbolic data generation.However, the information that can be packed into the prompt is hard limited by the prompt length, as language models are created and trained only to handle sequences of a certain length.The problem becomes more acute for symbolic languages with complex grammar and is rarely seen by the LLMs during the pre-training stage.Possible solutions are internalizing the grammar knowledge into the output rather than input through constrained decoding algorithms (Scholak et al., 2021;Wu et al., 2021;Shin et al., 2021;Shin and Van Durme, 2022), identifying limited relevant documentation when generating data (Agarwal et al., 2020;Zhou et al., 2022), or improving the architectures of LLMs to handle long inputs (Katharopoulos et al., 2020;Peng et al., 2020;Press et al., 2021).In addition, alternative evaluation metrics such as tree edit distances or Smatch (Cai and Knight, 2013) can be employed to reflect the similarity between two symbolic languages when execution is impractical.

B Implementation Details
For prompting or in-context learning with Codex, we use code-davinci-002 and a maximum context size of 7000.For all the tasks, we set the temperature to 0.8 and the number of samplings to 30 for answer generation.When generating questions, we construct initial 200 prompts by randomly selecting in-context examples 5 and use the mixture of temperature (i.e., 0.6, 0.8, and 1) with a number of of 100 to generate at most 60k questions.For Spider, we generate 200 questions for each of the 140 databases in the training set, which results in at most 84k data pairs using three temperatures.We set the number of shots to 10 in the few-shot setting.In the full-data setting, as found by prior work that including similar exemplars helps in answer prediction (Liu et al., 2022;Wu et al., 2022;Ye et al., 2023), we use all-mpnet-base-v2 (Song et al., 2020)6 to encode questions and Faiss7 to search similar examples.We truncate the number of in-context examples based on the maximum context size and order the examples from least to most based on similarity score.
We mainly use T5-large (770M) and T5-3B (Raffel et al., 2020) as task models for all the datasets.For MBPP (python) dataset, we find the original tokenizers of T5 is based on SentencePiece and would remove the indentations and blankspaces in the codes when doing tokenization, and therefore would influence the execution of Python program when generating the code string.Based on this reason, we use CodeT5-large (770M; Wang et al. 2021b) on MBPP dataset.
For training T5, we adopt the setting from Xie et al. (2022), where we use a batch size of 32, an Adafactor (Duchi et al., 2011) optimizer for T5large, an AdamW (Loshchilov and Hutter, 2018) optimizer for T5-3B, a learning rate of 5e-5, a linear learning rate decay and a maximum number of training epochs of 50 with early-stopping patience of 5.In the full-data setting, we use the strategy of first tuning on the mixture of synthesized and human-annotated data, then continue tuning it with only the human annotation data.We find this twostage training performs better than the importanceweighted loss (see Appendix C for details).

C Training Strategy
We compare the training strategies when we have both full human annotated data and generated data in Table 6.We can see the two-stage training procedure that first trains on the mixture on both datasets and then solely on human annotated data outperforms the weighted training baselines.

D Question Verification Results
We measure the quality of a question through answer consistency, where more generated answers are semantically equivalent means the question is less ambiguous and considered as high quality.We show the effect of the threshold used to filter ambiguous question in

E.3 Human Evaluation on Data Pair Quality
In this section we present some typical examples of the question ambiguity and underspecification problems in data generation by SYMGEN.The generated questions may be underspecified and ambiguous, even sometimes unreasonable, therefore influencing the generation of corresponding answers.Some examples are presented as follows. #
--Question: What are the names of the heads who manage the department with ID 15? SELECT Translate the natural language description to bash Natural Language: Recursively removes all files and folders named '.svn' in a current folder, handling content of removed folder before folder inself.
Natural Language: find all executable files in /home directory.
Natural Language: Locate files that reside in the /u/bill directory tree and were last accessed between 2 and 6 minutes ago Natural Language: Search the current directory tree for files whose names match regular expression '.*packet.*',ignoring the case Natural Language: List all the emptry files in thecurrent directory only.
Natural Language: Find all files under current directory whose status was changed less than 3 days ago and show last 5 lines of output Natural Language: Find files that were modified more than 7 days ago and archive them Natural Language: Set variable 'file' to the base name of first argument to script or function, that is the part following the last slash.
Natural Language: Connect to host "$USER_AT_HOST" in master mode in the background without executing any commands and set the ControlPath to " $SSHSOCKET" Natural Language: Print input "your, text, here" formatted to fit 70 characters per line breaking at spaces Natural Language: Translate the natural language description to bash commands.
Natural Language: Recursively removes all files and folders named '.svn' in a current folder, handling content of removed folder before folder inself.Bash commands: find .-depth -name .svn-exec rm -fr {} \; Natural Language: find all executable files in /home directory.Bash commands: find /home -type f -perm /a=x Natural Language: Locate files that reside in the /u/bill directory tree and were last accessed between 2 and 6 minutes ago Bash commands: find /u/bill -amin +2 -amin -6 Natural Language: Search the current directory tree for files whose names match regular expression '.*packet.*',ignoring the case Bash commands: find .-iregex ".*packet.*"Natural Language: List all the emptry files in thecurrent directory only.Bash commands: find .-maxdepth 1 -empty Natural Language: Find all files under current directory whose status was changed less than 3 days ago and show last 5 lines of output Bash commands: find .-type f -ctime -3 | tail -n 5 Natural Language: Find files that were modified more than 7 days ago and archive them Bash commands: find .-type f -mtime +7 | xargs tar -cvf `date '+%d%m%Y'_archive.tarǸatural Language: Set variable 'file' to the base name of first argument to script or function, that is the part following the last slash.Bash commands: file=`basename "$1"Ǹ atural Language: Connect to host "$USER_AT_HOST" in master mode in the background without executing any commands and set the ControlPath to " $SSHSOCKET" Bash commands: ssh -M -f -N -o ControlPath="$SSHSOCKET" "$USER_AT_HOST" Natural Language: Print input "your, text, here" formatted to fit 70 characters per line breaking at spaces Bash commands: echo 'your, text, here' | fold -sw 70 Natural Language: Find files with names that start with 'input' and end with a single character 'a' or 'b' in the current directory and all its subdirectories Bash commands:  Natural Language Instruction for Python Code: Write a function to find the similar elements from the given two tuple lists.
(omitted to save space) %Translate the natural language description to prolog commands.

Figure 4 :
Figure 4: (a) Comparison of different verification methods.We show improvement over the baseline which directly takes the answer with maximum log-probability as output without verification; (b) Results for Codex with in-context learning and T5-large with SYMGEN using different numbers of human annotations on MTOP dataset.

Figure 5 :
Figure 5: (a) TSNE visualization of data generated by SYMGEN (randomly sample 5000 examples) and the original data in the MBPP dataset.(b) Comparison of the length distribution of answers between the original data and SYMGEN on Spider, with the length as x axis and the probability density as y axis.More visualizations are presented in Appendix E.1.

Figure 7 :Figure 8 :
Figure 7: Comparison on the distribution of the questions' embedding (obtained by SBERT) in Spider (randomly sample one database) and MBPP from SYMGEN (randomly sample 5000) and the original datasets.

Figure 10 :
Figure 10: Example prompt for generating SQL queries for Spider, only single in-context example is shown for illustration.

Figure 11 :
Figure 11: Example prompt for generating questions for NL2Bash.

Figure 12 :
Figure 12: Example prompt for generating bash commands for NL2Bash.
Translate the natural language instructions to Python Natural Language Instruction for Python Code: Write a function to find squares of individual elements in a list using lambda function.Natural Language Instruction for Python Code: Write a function to find all words which are at least 4 characters long in a string by using regex.Natural Language Instruction for Python Code: Write a python function to find the minimum number of rotations required to get the same string.Natural Language Instruction for Python Code: Write a python function to check whether the two numbers differ at one bit position only or not.Natural Language Instruction for Python Code: Write a python function to identify non-prime numbers.Natural Language Instruction for Python Code: Write a function to find the largest integers from a given list of numbers using heap queue algorithm.Natural Language Instruction for Python Code: Write a function to get the n smallest items from a dataset.Natural Language Instruction for Python Code: Write a function to find the number of ways to fill it with 2 x 1 dominoes for the given 3 x n board.Natural Language Instruction for Python Code: Write a function to find the minimum cost path to reach (m, n) from (0, 0) for the given cost matrix cost [][] and a position (m, n) in cost[][].

Figure 13 :
Figure 13: Example prompt for generating question descriptions for MBPP.

Figure 18 :
Figure 18: Example prompt for generating question decompositions for Break.

Figure 21 :
Figure 21: Prompt comparison for GeoQuery and GeoQuery-SQL.Only two demonstrations and a part of symbolic knowledge are shown for simplicity.

Table 2 :
Results of data generation for training a task model under full data and few-shot settings.The top-scored results for each setting are bold.We show the average improvement with SYMGEN across all tasks in the last column.

Table 3 :
Results for zero-shot data generation on Spider.

Table 4 :
Results of few-shot answer generation with different prompts.GeoQuery-SQL refers to converting the language of few-shot examples from the original Prolog commands in GeoQuery dataset to SQL.We found symbolic knowledge and language reformulation both play key roles in generation quality, and the effect of natural instruction varies for different symbolic languages.

Table 5 :
Comparison of different data augmentation methods on GeoQuery dataset.SYMGEN provides a larger boost to the performance, especially in the fewshot setting.
We find 81 and 79 examples are correct for MBPP and Spider, respectively.Apart from that, we also find that SYMGEN generates more operators such as julianday, union in SQL compared to the original dataset, and the generated questions covered a wide range of data structures including dict, list, and queue for MBPP.
Human Evaluation of Data-pairs In order to evaluate the quality of generated data, we also present human evaluations on the data-pair quality of generated Spider and MBPP datasets.We randomly sample 100 examples from SYMGEN for both datasets and manually review the sampled data.
Timo Schick and Hinrich Schütze.2021.Generating datasets with pretrained language models.In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6943-6951, Online and Punta Cana, Dominican Republic.Association for Computational Linguistics.
Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu.2020.Mpnet: Masked and permuted pretraining for language understanding.Advances in Neural Information Processing Systems, 33:16857-16867. 1 We determine model performance based on surface form Exact Set Match (EM; Yu et al. (2018b)) and test-suite Execution accuracy (EX; Zhong et al. (2020a)) which extends execution to multiple database instances per SQL schema to provide the best approximation of semantic accuracy.and Template Accuracy where the query tokens are discarded (e.g., the template of [IN:A [SL:B text]] is [IN:A [SL:B]]).

Table 7 .
We can see the model trained on a much smaller size of data can outperform the one trained on original data, indicating low quality data can interfere with the training process.
Figure 6: Comparison on token-level length distribution of the questions on Spider and MBPP.
Write a question that can be answered based on the above tables.--Question:Listthe type of bed and name of all traditional rooms.Using valid SQLite, answer the following questions for the tables provided above.--Question:List the type of bed and name of all traditional rooms.SELECT roomName , bedType FROM Rooms WHERE decor = "traditional"; Yu et al. (2018aYu et al. ( , 2020) ) et al. ( , 2020) )follow the same spirit and use a hand-crafted SCFG grammar to generate new parallel data.However, rule-based heuristics or a large pool of seed examples are needed to induce the grammar.