Exploring the Curious Case of Code Prompts

Recent work has shown that prompting language models with code-like representations of natural language leads to performance improvements on structured reasoning tasks. However, such tasks comprise only a small subset of all natural language tasks. In our work, we seek to answer whether or not code-prompting is the preferred way of interacting with language models in general. We compare code and text prompts across three popular GPT models (davinci, code-davinci-002, and text-davinci-002) on a broader selection of tasks (e.g., QA, sentiment, summarization) and find that with few exceptions, code prompts do not consistently outperform text prompts. Furthermore, we show that the style of code prompt has a large effect on performance for some (but not all) tasks and that fine-tuning on text instructions leads to better relative performance of code prompts.


Introduction
Recent work has shown that pre-training language models (LMs) on a mixture of text and program code (e.g., Python or Javascript) makes them more capable of reasoning over natural language (Suzgun et al., 2022).Such program-trained language models (PLMs) significantly outperform text-only LMs on tasks such as math problems and tracking shuffled objects despite such tasks lacking any explicit code formulae (Liang et al., 2022).
Furthermore, prompting such PLMs with codelike structures (e.g., Python, JSON, PDDL) instead of text has been shown to lead to performance improvements on structured common sense reasoning (Madaan et al., 2022), event argument extraction (Wang et al., 2022), knowledge graph construction (Bi et al., 2023), story understanding (Dong et al., 2022), and causal reasoning (Zhang et al., 2023).Such results naturally lead us to ask whether code-prompting is the preferred way of interacting with PLMs in general.While previous work is limited to reasoning tasks, in this work we analyze a broad selection of tasks (e.g., QA, sentiment, summarization) and systematically compare the performance of prompting PLMs with code vs. prompting with text1 .We find that: • With the exception of some reasoning tasks, code prompts do not outperform text prompts • The style of code prompt has a large effect on performance for some but not all tasks.• Fine-tuning on text instructions leads to relative improvements when using code prompts.

Experimental Design
Model Selection For our text-based LM we use the original 175 billion parameter davinci model introduced by Brown et al. (2020).For our PLM we use the newer code-davinci-002 model which was explicitly trained on text and code.
Neither model underwent any supervised instruction fine-tuning.In addition, we analyze performance on text-davinci-002, which is a variant  of code-davinci-002 trained explicitly on human demonstrations using supervised fine-tuning2 .We include this model to help us determine whether or not fine-tuning PLMs on text instructions affects their ability to interpret code prompts.All three models were queried through the OpenAI API3 and our experiments cost approximately $2700 in total (see Appendix E for the full cost breakdown).
Task Selection Following the methodology of instructions and inputs are given as a taskspecific class.Functionality is "implemented" as member functions.Figure 2 shows an example of the different styles of code prompts for the wikiHow temporal ordering task.Note that we attempt to write our code prompts such that we match the wording of the textbased PromptSource prompt as closely as possible.
At inference time, for each test example, we randomly sample in-context examples from the training set and add them to the context window until the maximum context length is reached.This process circumvents the bias caused by static in-context examples.We conduct an ablation study where we vary the random seed and show that this process produces consistent results (see Appendix C).

Results
What is the best type of code prompt?We compare performance across the four code prompt types from Section 2 on all 12 tasks using code-davinci-002 and report our results in Figure 3.We find that no single type of code prompt performs significantly better than the others across all tasks and that the relative difference in performance between code prompts also varies significantly across tasks.For example, on IMDb and SQuAD all code prompts have roughly even perfor-You are trying to {goal}.You need to do two things: (a) {step0} (b) {step1} The first thing to do is {first} input0 = "Given a goal and two steps, predict the correct order to do the steps to achieve the goal" input1 = "{goal}" step0 = "{step0}" step1 = "{step1}" label = [{first},{second}] Code Prompt (vanilla) Text Prompt instructions = "Given a goal and two steps, predict the correct order to do the steps to achieve the goal" goal = "{goal}" step0 = "{step0}" step1 = "{step1}" order_of_exec = [{first},{second}] Code Prompt (VI -var identifier) """Given a goal and two steps, predict the correct order to do the steps to achieve the goal""" # The goal that someone is trying to achieve goal = "{goal}" # One of the steps that needs to be taken step0 = "{step0}" # Another one of the steps that need be taken step1 = "{step1}" # The list of correct order of those two steps order_of_exec = [{first},{second}] Code Prompt (VIC -var identifier + comments) import order_steps class Event: """Given a goal and two steps, predict the correct order to do the steps to achieve the goal""" def __init__(self, goal, step0, step1): self.goal= goal # The goal someone is trying to accomplish self.step0= step0 # One of the steps that need be taken self.step1= step1 # Another step that need be taken def get_order_of_steps(self): # Output a list of correct order of the two steps to be taken return order_steps(self.goal,self.step0,self.step1)event = Event(goal="{goal}", step0="{step0}", step1="{step1}") assert(event.get_order_of_steps== [{first},{second}])

Code Prompt (CVIC -class + var identifier + comments)
Figure 2: An example of the four styles of manually written code prompts used in our analysis (Vanilla, VI, VIC, and CVIC) for the wikiHow temporal ordering task.At test time, variables in braces are replaced with information from the dataset item (as shown in Figure 1).For this task, {goal}, {step0}, {step1} refer to the article title and the steps to order while {first} and {second} refer to the true ordering of the steps.We see that different prompts do better on different tasks and while some tasks have high variance over prompt types, others do not.
mance while for tasks such as wikiHow-Temporal and WinoGrande we see a near 14% accuracy difference between the worst and best prompt.
In Appendix B, we calculate the average rank of each code prompt type relative to each other and find that the "Var Identifier + Comments" (VIC) prompt is the best across all tasks on average (2.25 avg.rank).We thus use this prompt type for our comparison in all future sections.amples in the prompt affects models' ability to perform the task.We therefore conducted an experiment where we filled the context window of code-davinci-002 with in-context examples up to 2000 tokens, 4000 tokens, and 8000 tokens and plotted the validation accuracy of the model with respect to the number of examples in Figure 4.
Contrary to expectations, we find that the number of in-context examples has little effect on model performance for most tasks and actually has a negative effect on some tasks.This is especially interesting given that previous work on in-context learning with text prompts finds roughly monotonic improvement from adding more in-context examples (Liu et al., 2021).While further research is necessary, it seems that code prompts may have  different scaling behavior than text prompts when used in in-context learning.
Which is better: code or text prompts?In our main experiment we compare the performance of the three GPT models on code prompts (VIC style) and text prompts across the 12 datasets.Given the results from Figure 4, we fill the context window of all models with in-context examples up to 4000 tokens to serve as a middle ground for comparing code and text prompts.We report the results of our main experiment in Table 2 and see several surprising trends.First, we find that prompting PLMs with code leads to substantial increases in performance for certain few reasoning tasks but that this trend does not hold across all tasks-or even all reasoning tasks.For example, when using code prompts with code-davinci-002, we see a 10.5% accuracy increase on wikiHow temporal ordering but a 2.6% accuracy decrease on wikiHow goal-step inference despite both being commonsense reasoning tasks and having identical source material.
Second, we find that supervised instruction fine-tuning on natural language demonstrations does not hurt model performance on code.Rather, we instead observe that code prompts outperform text prompts on more tasks when using text-davinci-002 than when using code-davinci-002 despite the fact that text-davinci-002 received no additional finetuning on code instructions.
Finally, we find that LMs not explicitly trained on code can also benefit from code prompting on certain reasoning tasks.In particular, code prompts outperform text prompts on davinci for 3 out of our 12 tasks-the same proportion as code-davinci-002.The tasks that benefit from code prompts also seem to be largely consistent across the three types of models tested, suggesting some underlying trend as to which tasks systematically benefit from structured input.

Conclusion
In this work we investigate whether or not there exists a systematic performance difference between prompting PLMs with code or with text.We confirm that there are indeed tasks for which code prompting is significantly more effective than text prompting and that this finding holds across different types of models.However, for most tasks, we find that text prompting is still the best method for eliciting few-shot generalization from PLMs.
Given this result it seems reasonable to attempt to predict which tasks will benefit from code prompts and which tasks will not.However, we show that making such predictions based on simple heuristics such as domain and task category is difficult and that the larger trends remain unclear.Future work should seek to investigate the core mechanism behind what makes code prompting effective for certain tasks.
Finally, concurrent to our work, a new line of research has emerged wherein models generate code and execute that code to produce valid output (Chen et al., 2022;Mishra et al., 2022;Gao et al., 2022;Lyu et al., 2023).Future work should consider-ing whether or not the tasks that benefit from executable code prompts and non-executable code prompts have any overlap.

Limitations
One significant limitation to our study is that, as of March 23rd 2023, OpenAI has deprecated access to code-davinci-0024 , thus rendering our results non-replicable for any team not granted special access to these models by OpenAI.We did not anticipate this deprecation while conducting this work and we believe this raises serious questions about the usage of API-based language models in scholarly work.
Another limitation is that the 12 tasks we selected may not be representative of the broader population of natural language tasks.Had we conducted our experiments on a larger selection of tasks there may have been larger-scale trends that we would have been able to uncover.
The largest and most pressing limitation with our work is that the models we are testing on have closed-source pre-training datasets.Thus, we are unable to verify the extent to which our task datasets have been included in the training or instruction fine-tuning data.Given that the training data for most of the models tested in this work cuts off in late 2021, this is a very strong possibility.Our results should be viewed with this limitation strongly in mind.
Finally, while we experimented with different code prompts, the search space of possible prompts is very large.Thus, it is very likely that there exists some prompt that outperforms our chosen prompts for each task.Drawing conclusions based on a limited sampling of prompts is tenuous and while methods exist for searching the space of all prompts, such techniques lack interpretability and erase any distinction between code and text prompt (Li and Liang, 2021).

A Detailed Task Description
Summarization is the task of composing a concise description of a lengthy text.Given a long narrative, the model is tasked with composing a short summary that contains the salient events in the original text.
For our study, we select the CNN/Daily Mail (Hermann et al., 2015;Nallapati et al., 2016) and XSUM (Narayan et al., 2018) datasets as both are variants on the challenging abstractive summarization task.XSUM tasks models with generating extremely concise 1 to 2 sentence summaries of news articles and CNN/Daily Mail tasks models with generating reasonably concise but longer abstractive summaries.For both CNN/Daily Mail and XSUM datasets, we use ROUGE-2 score for evaluation.
Question Answering (QA) is the task of composing answers given a question and an optional context passage.When this context passage is provided the task is referred to as "open-book" QA and when it is not it is referred to as "closed-book" QA.Open-book QA tasks examine language models' ability to understand and extract information from their context while Closed-book QA tasks evaluate the amount of knowledge encapsulated in language models during pre-training.
For our study we pick two open-book QA datasets, SQuADv2 (Rajpurkar et al., 2018) and HotpotQA (Yang et al., 2018), which allow us to focus our evaluation on how structured prompts affect models' ability to comprehend long text input.
For both SQuADv2 and HotpotQA, we evaluate model performance based on the macro-averaged F1 score as proposed in Rajpurkar et al. (2016).This metric measures the average overlap between the prediction and ground truth answer.It is calculated by treating the prediction and ground truth as bags of tokens, and first computing their F1.Then, the maximum F1 score is taken over all of the ground truth answers for a given question, and that score is averaged over all of the questions to get the final result.
Commonsense Reasoning is a machine reasoning task that demands the use of commonsense knowledge which is oftentimes implicitly present in the text (Sap et al., 2020).The customary formulation of commonsense reasoning tasks are Classification, where the input is a context, optionally with candidate answers as choices, and the output is a label from a pre-defined label space, and Question Answering (QA), where the input is a context followed by a reasoning question and the output is in free-form language.
For wikiHow Goal-Step, wikiHow Temporal, HellaSwag, WinoGrande, and ANLI, we use classification accuracy as the evaluation metric.To evaluate OpenPI, we use F1 score based on the ROUGE metric as described in the original paper (Tandon et al., 2020).
Sentiment Analysis is a task that is concerned with judging emotion and its degree in text.Given Table 3: Relative performance rank of the four code prompt types from Section 2 across the 12 tasks.Ranks are calculated based on the results reported in Figure 3.We see that the "Variable Identifier + Comments" (VIC) style prompt performs the best out of all code prompt types on average.a passage, a language model is tasked with classifying the sentiment (positive, negative, neutral) and/or its degree (strongly, weakly, moderately).
The selected datasets, namely IMDb (Maas et al., 2011) and Yelp (Zhang et al., 2015), are both constructed using customer reviews.The IMDb dataset proposes a binary classification problem where the input is a movie review and the label space is {negative, positive}.Yelp proposes a five-way classification problem where the input is a restaurant review and the label space is the number of stars (out of 5) the customers assigned to the restaurant.
For IMDb, we use accuracy as the evaluation metric and for Yelp, we use Pearson Correlation between the predicted rating and the ground truth rating as the evaluation metric.

B Ranking of Code Prompt Styles
In Table 3 we report the rank-based statistics of the four code prompt types from Section 2 on our 12 tasks.Ranks are calculated based on the results reported in Figure 3 of the main paper.The numbers in a row reflect the relative standing of each code prompt on the corresponding task.While we note that all code prompts perform within ±0.5 ranks of each other on average, we see that on average the VIC prompt performs the best across all tasks and the Vanilla prompt performs the worst.Looking to the standard deviation section, we see that the VI prompt performs the most consistently across  11.7,12.0,12.4,12.3,12.0 ±0.25 XSUM 14.5,14.9,15.5,15.2,15.4 ±0.36  all tasks and that once again the Vanilla prompt performs the least consistently.

C Ablation Study
To see whether the findings in our Results section could be attributed to variance in the random sampling of in-context training examples per test example, we conduct five repeated runs using code-davinci-002 with different random seeds each time and calculated the standard deviation across the five runs.We report our results in Table 4 and find that the choice of in-context examples accounts for very little of the observed variance across prompt type and context length.This finding is surprising as previous work has shown that the selection and ordering of in-context examples has a very large effect on the performance of models (Liu et al., 2021).However, it seems that our approach of random sampling in-context examples per test item helps to lessen this inherent variance.

D Evaluation on text-davinci-003
While conducting our research into the differences between code and text prompts, OpenAI released the text-davinci-003 model.This model differs from text-davinci-002 in that it is trained using Reinforcement Learning with Human Feedback (RLHF) instead of supervised instruction fine-tuning (Ouyang et al., 2022).Out of curiosity, to see the effect of this new training paradigm, we conducted experiments comparing this new text-davinci-003 model to the other GPT-3.5 models (text-davinci-002 and  code-davinci-002).We report the results of our comparison across the 12 evaluation tasks in Table 5.
We see that while text-davinci-003 outperforms all previous models on wikiHow Temporal, WinoGrande, and OpenPI, it does significantly worse than previous models on wikiHow Goal-Step and HotpotQA.Such large reductions in performance are to be somewhat expected when using RLHF given the costly nature of collecting human demonstrations.However, the magnitude of the decreases (-50.1% for wikiHow and -11.2% for HotpotQA) is nonetheless surprising and such results raise important questions about exactly what is being learned when conducting instruction finetuning and whether or not this learned information can generalize to tasks not seen during fine-tuning.

E Evaluation Cost
In this section we report the approximate cost of conducting our experiments.In our study we use four OpenAI models, namely davinci, code-davinci-002, text-davinci-002 and text-davinci-003.While code-davinci-002 is free to use at the time of this study, we report the approximate cost of running the experiments on the other three models5 in Table 6 estimate the cost of an experiment, we calculate the approximate number of tokens necessary for computing one dataset example and then multiplied that by the number of examples in the dataset.For classification tasks, since we fill up the context window to roughly 4000 tokens for every test example, we estimate the number of tokens to be 4000 (3999 tokens for the prompt and 1 token for the label).To estimate cost for generative tasks (OpenPI, HotpotQA, SQuAD, CNN/Daily Mail, and XSUM), we compute the average generation length from our generated samples and assume the in-context examples take up 3500 tokens.While this calculation results in a fairly loose upper bound, we believe this to be a good estimate of the total cost incurred by the project as such overestimates help offset the cost of other miscellaneous API queries done over the course of the project.
You are trying to draw a simple teddy bear.You need to do two things: (a) erase unnecessary lines (b) draw a shirt for the bear The first thing to do is instructions = "Given a goal and two steps, predict the order to do the steps to achieve the goal" goal = "Draw a Simple Teddy Bear" step0 = "erase unnecessary lines" step1 = "draw a shirt for the bear" order_of_execution =

Figure 1 :
Figure 1: For certain tasks, prompting program-trained language models with code-like representations works better than prompting with text.

Figure 3 :
Figure 3: Comparison of code-davinci-002 across the four types of code prompts.Figures are split toallow for different y-axis scales.We see that different prompts do better on different tasks and while some tasks have high variance over prompt types, others do not.

Table 1 :
Rajpurkar et al. (2016).Macro F1 is based onRajpurkar et al. (2016).For each task, we randomly sample a fixed set of 1000 examples from its validation or test set for evaluation.For OpenPI we are limited to 111 examples.
How many in-context examples should we include in our code prompt?We would like to also investigate how the number of in-context ex- Figure4: Performance score (y-axis) vs number of in-context examples (x-axis, in log scale) using code prompts (VIC) with code-davinci-002.We see that increasing number of examples does not always increase performance and in some cases makes it worse.

Table 2 :
Performance of the three LMs when using code prompts (+Code) vs. using text prompts (+Text).Blank cells indicate tasks for which single test examples could not fit in the context window.Color indicates whether or not code prompts are better, slightly better, slightly worse, or worse than text prompts.We see that while code prompts outperform text prompts for certain tasks (such as wikiHow temporal and WinoGrande) text prompts are better on average.We also find that instruction fine-tuning (text-002) allows for better code prompt utilization.

Table 4 :
Comparison across 5 repeated runs of the code-davinci-002 model with text prompts using different random seeds for sampling in-context examples.We see minimal standard deviation (σ) between the runs.

Table 5 :
(Ouyang et al., 2022)ree GPT-3.5 models across our 12 datasets with text prompts.(+IFT)indicates the addition of supervised instruction finetuning and (+RLHF) indicates the addition of training using Reinforcement Learning from Human Feedback(Ouyang et al., 2022).We see that RLHF does not always improve performance and that for some tasks (HotpotQA and wikiHow Goal-Step) it causes large degradations in performance.

Table 6 :
. To The total estimated cost of running davinci, text-davinci-002 and text-davinci-003 for 1000 data samples from each dataset (except for OpenPI).