Batch Prompting: Efficient Inference with Large Language Model APIs

Performing inference on large volumes of samples with large language models (LLMs) can be computationally and financially costly in industry and real-world use. We propose batch prompting, a simple yet effective prompting approach that enables the LLM to run inference in batches, instead of one sample at a time. Our method reduces both token and time costs while retaining downstream performance. We theoretically demonstrate that under a few-shot in-context learning setting, the inference costs decrease almost inverse linearly with the number of samples in each batch. We extensively validate the effectiveness of batch prompting on ten datasets across commonsense QA, arithmetic reasoning, and NLI/NLU: batch prompting significantly~(up to 5x with six samples in batch) reduces the LLM (Codex) inference token and time costs while achieving better or comparable performance. For state-of-the-art Chat-based LLMs, e.g., GPT-3.5 and GPT-4, we show the benefits of batch prompting also hold. Further analysis shows that the number of samples in each batch and the complexity of tasks affect its performance. Moreover, batch prompting can be applied across different reasoning methods using LLMs. Our code can be found at the site https://github.com/xlang-ai/batch-prompting.

We propose batch prompting, a simple yet effective approach for prompting LLMs, which allows the model to perform inference on multiple samples at once, instead of one sample at a time.This reduces token and time costs while still retaining downstream performance, without any change in APIs.As shown in Figure 1, standard prompting generates a response (answer) to one sample at a time, which takes N inference runs of an LLM for a test set of size N .For our batch prompting, on the other hand, an LLM generates responses to b samples in a single inference run and only takes N/b runs for the same N samples.
We first demonstrate theoretically that under the few-shot in-context learning setting, most tokens consumed during the API call are the few-shot exemplars, and only a small portion of token budgets are used for the particular inference sample(s) (Section 2).Therefore, increasing b in batch prompting reduces the token and time costs in an inverse linear fashion.We extensively validate the effectiveness of batch prompting on diverse downstream datasets across commonsense QA, arithmetics, and NLI/NLU using Codex, a strong variant of GPT-3 finetuned on code data (Section 3).We also test batch prompting on the state-of-the-art GPT-3.5 and GPT-4 models.Batch prompting significantly decreases the tokens and run time of using LLMs while achieving comparable or even better performance on all ten datasets.
In further analysis (Section 4), we find the number of samples in batch and the complexity of tasks affect its performance.Moreover, we show that batch prompting works well across different reasoning methods (e.g., end-to-end, Chain-of-Thought, and code generation), suggesting that batch prompting is an efficient drop-in substitute for conventional prompting.

Approach
We first introduce batch prompting, an efficient alternative to standard prompting.We then compare the token and time costs of batch and standard prompting, demonstrating the efficiency of our method.
prompt tokens and generated tokens, and each call takes 3 seconds to finish (a plausible average time in real use).

Problem Setup
The conventional paradigm (i.e., standard prompting in Figure 1) to prompt LLMs for in-context learning is as follows: K in-context few-shot exemplars with both a context (e.g., question) and an output (e.g., answer) are selected to build the input prompt, one test sample with context only is appended at the end of the prompt, and the LLM is used to generate the response for the test sample.
In this paper, we focus on a realistic scenario with N test samples in total, which is common when benchmarking on a dataset or handling a large volume of customer requests.In this case, it takes N separate calls of the LLM inference under the standard prompting paradigm.

Batch Prompting
Batch prompting enables the LLM to generate responses for multiple samples in one batch in a single inference run, so that it reduces the LLM inference time from N to N/b, where b is the number of samples in one batch.Specifically, as shown in Figure 1, our prompt groups the K in-context exemplars into K/b batches with b exemplars each as demonstrations.In every batch, demonstration contexts are arranged in a specific order at the beginning, with their corresponding outputs placed in the same order afterwards.Then, b test sample contexts are grouped together at the end of the input prompt.In this way, the LLM learns from the in-context demonstrations and generates corresponding responses for the entire batch of test samples.We add a position identifier "[index]" within each batch to 1) assist the LLM with identifying the order correspondence of input contexts and generated responses and 2) ease the process of parsing the generated responses.

Token Cost
The costs of one LLM call scale linearly with the number of tokens, including both the input prompt tokens (few-shot and instruction) and generated tokens (according to, for example, OpenAI's pricing).Most tokens are consumed by the prompt tokens in standard prompting because the number of prompt tokens is usually far more than the number of generated tokens so that the LLM can better learn from in-context exemplar.Thus, the larger the portion of tokens spent on generated tokens, the more economical the total cost is.
We define token efficiency η as the portion of tokens spent on generated tokens in one LLM call.
For standard prompting and batch prompting (the instruction tokens are omitted if any for brevity): When K ≫ 1 and b < K, η batch scales almost inverse linearly with b, and thus increasing b of batch prompting can greatly reduce token costs.

Time Cost
Intuitively, batch prompting reduces the inference time by decreasing the number of API calls from N to N/b.Considering the Transformer (Vaswani et al., 2017) decoding time, the cost will increase with b in batch prompting due to the generation of longer responses compared to standard prompting.We give a detailed derivation from Transformer architecture perspective in Appendix A.
However, as most end-users are accustomed to and only have access to LLM API services, this part of time cost is marginal (observed in main experiments), relative to the overhead of API call and request rate limits per minute set by a company, such as OpenAI.Besides, cases may occur when network connections are unstable or slow, and the users seek to finish a task with as few LLM calls as possible.
Therefore, in practice, reducing the number of calls from N to N/b with batch prompting can essentially lower the time costs.Note that when the API call overhead and rate limits are no longer the major bottlenecks of time costs in the future, then the increased decoding time to generate longer sequences discussed in Appendix A cannot be overlooked, and the time reduction of batch prompting will not be as pronounced.
Since LLM infrastructure/services can change over time, the token cost comparison is more reliable and durable to measure than time costs.

Experiments
We extensively evaluate batch prompting across ten diverse datasets.Our results suggest that batch prompting can achieve at most 5× token and time efficiency (with six samples in batches) improvement with similar or even better downstream performance.

Experimental Setups
We evaluate OpenAI Codex (code-davinci-002) as the LLM in our main experiments across ten datasets.Codex was provided for free when the paper was written, but the token consumption reduction is the same as the other LLMs, ensuring that the token costs in experiments are general.
We also test the batch prompting performance on other state-of-the-art LLMs, including GPT-3(textdavinci-003), GPT-3.5 (gpt-3.5-turbo),and .For GPT-4, we test the first 100 samples for each dataset, considering the budget.The decoding temperature is set as 0. For each dataset, we manually select 12-shot samples from the training set as in-context exemplars, with Chain-of-Thought (Wei et al., 2022, CoT) reasoning steps in the answers (in Section 4.4, other reasoning methods beyond CoT are discussed).We choose 12 exemplars because 12 is the least common multiple of 2, 3, 4, 6, and thus it is easy to analyze the effects of grouping them into batches of 2, 3, 4, 6 samples in our ablation studies.More experimental details and full results are listed in Appendix B.

Main Results
Figure 2 compares the token and time costs of standard and batch prompting.As shown, batch prompting substantially (up to 5× with 6 samples in each batch) reduces both the token and time Language models are GPT-3 (text-davinci-003), , and .Batch prompting can be applied well on different LLMs with good performance.
costs of standard prompting with Codex.Further, the decrease of costs scales almost inverse linearly with the number of samples in each batch, verifying our analysis in Sections 2.3 and 2.4.Note the time costs include the API call overhead and rate limit blocks, which exist in the commonly-used OpenAI and other LLM services.For LLM services where these are not bottlenecks of time, the decoding time increase from larger b should not be overlooked as discussed in Section 2.4.As the LLM infrastructure can change anytime, the token efficiency improvement is easier to compare than time; the token reduction in Figure 2 should hold for any LLM over time.
Table 1 shows that batch prompting (with the best b, i.e., the number of samples in each batch) performs comparably or even better than standard prompting over all ten datasets.We thus recommend that LLM users consider applying batch prompting to save money and time while maintaining good performance in realistic applications.
Table 2 shows performance from these LLMs.All tested LLMs demonstrate capabilities similar to Codex: batch prompting retains downstream performance across datasets.Actually, batch prompting Chat-based models tend to gain performance improvements.We deduce the reason is that GPT-3.5 and GPT-4 accept a specific role of system message as instruction, which makes them better follow batch prompting instructions to input and output in batches.As discussed in Section 2, the token efficiency of batch prompting should hold for different LLMs, though the decrease in time may vary depending on the LLM inference implementation.

Analysis
In this section, we assess factors influencing batch prompting performance and the tradeoff between costs and performance.We also demonstrate that batch prompting can be applied to various LLM prompting methods, such as end-to-end and code generation.

Selection of Batch Samples
Here we examine whether the selection of samples, i.e. how samples are grouped into batches, will affect the performance of batch prompting.We study two widely-adopted sample selection methods in in-context learning when grouping the test samples: grouping more similar (Rubin et al., 2021;Liu et al., 2022) and more diverse (Su et al., 2022;Agrawal et al., 2022) samples into batches.Specifically, given N test samples, to group similar ones, we use k-means clustering and post-process each cluster into equal size b by moving redundant samples to their closest groups with size < b.To group diverse ones, we apply the vote-k method (Su et al., 2022) to iteratively select diverse and representative groups of samples.
As listed in Table 3, both similarity and diversitybased selections do not show improvements over random grouping.We suspect that the reason may be that both methods assume in-batch samples can benefit from previous similar or diverse samples, i.e., samples in the front of the batch.However, these earlier samples without ground truth outputs may bring error propagation to the rest of the inbatch samples.Developing effective strategies for selecting samples for batch prompting could be a promising area for future research to further enhance the performance of batch prompting.

Complexity of Tasks
In Table 1, the steepest drop (from 46.1 to 42.1) occurs on AQuA dataset: an arithmetic reasoning task in a multi-choice QA format.One possible interpretation is that AQuA is more difficult than other datasets with the lowest absolute accuracy 46.1%, and thus LLMs are more likely to be disturbed when input contexts are grouped together.
We further study another task aspect that may As shown in Figure 4, in standard prompting (b = 1), inputting table schemas with three rows dominates QA performance.However, it also sees the steepest performance drop when b increases using batch prompting.The shorter the input contexts, the steadier the performance with batch prompting.This suggests that long task inputs are more likely to lead to confusion and performance drops when batch prompting is applied.

Reasoning Methods
In our main experiments (Section 3), we used the Chain-of-Thought (CoT) for all ten datasets.Here we examine whether batch prompting is suitable for other common LLM reasoning methods.We experiment with two more reasoning methods: endto-end (i.e., directly prompt the LLM to output the answers without intermediate steps) and programbased, (i.e., prompt the LLM to generate programs to answer the question).For the program-based methods, we adopt Binder (Cheng et al., 2022) on WikiTQ and Program-of-Thought (Chen et al., 2022, PoT) on GSM8K and SVAMP.
As seen in Table 4, both end-to-end and programbased methods can benefit from the efficiency of batch prompting while maintaining similar or even better performance on the task.This indicates batch prompting is a drop-in replacement that can be combined with various reasoning methods under diverse scenarios.

Related Work
Improve In-Context Learning.The impressive capabilities of large language models (Brown et al., 2020;Chen et al., 2021;Chowdhery et al., 2022, LLM) have sparked a surge of recent research aiming to enhance in-context learning (ICL) performance.Several works propose different reasoning methods to prompt LLMs (Wei et al., 2022;Zhou et al., 2022;Khot et al., 2022), showing great improvements over directly prompting LLMs to output answers.Other works (Chen et al., 2022;Gao et al., 2022;Cheng et al., 2022) generate programs to solve reasoning tasks.Another line of work (Liu et al., 2022;Su et al., 2022;Agrawal et al., 2022) focuses on selecting better in-context exemplars.This work adds a new dimension to ICL for largescale real-world applications: batch prompting to save budget and time while achieving good or even better performance.
Efficient Language Generation.Much recent work proposed methods for efficient language generation, including machine translation (Kasai et al., 2020(Kasai et al., , 2021a,b) ,b) and language modeling (Katharopoulos et al., 2020;Peng et al., 2021Peng et al., , 2022)), and model cascading (Varshney and Baral, 2022).Many of them introduce alternative architectures to the standard transformer to achieve such efficiency gains, which makes them hard to apply or deploy to real-world scenarios.Our method is a simple yet effective alternative to recent prompting methods, and thus it is applicable to any offthe-shelf language model APIs, such as OpenAI, Google, Anthropic, or any other available private LLM APIs, without any additional training or customized model hosting.

Limitation
Batch prompting has proven to be an efficient method for time and token reduction.Nonetheless, there are several critical considerations to keep in mind when implementing it across various scenarios.First, to optimize its benefits, the length of the input prompt tokens should be (significantly) greater than that of the output tokens.Thus, it might not be suitable for "heavy output" tasks like story generation.It is important to note that while our experiments are conducted with few-shot incontext learning, this method is also applicable to the instruction-following paradigm, either on its own or in combination, by simply substituting or adding the few-shot inputs with instructions.The only crucial factor is the length of the shared input tokens of inference samples.Secondly, it is possible to observe performance declines.Our experiments indicate that task complexity and lengthy input contexts can negatively impact performance.Although we have not identified a definitive guideline for predicting performance, we advise users to initiate testing with a smaller subset to gauge the effectiveness of batch prompting before implementing it on a larger scale.

Conclusion
We present batch prompting, a new way to prompt LLMs that performs inference on samples in a batched fashion.With batch prompting, multiple samples can be handled in one API call so that the costs of tokens and time can be significantly reduced.Extensive experiments on ten datasets across commonsense QA, arithmetics, and NLI/NLU show that batch prompting can achieve better or similar performance compared to standard prompting, with much lower token and time costs.We hope batch prompting offers pragmatic value to efficient real-world LLM usage.

A Time Cost Analysis Regarding
Transformer Architecture In batch prompting, assume there are K in-context exemplars (C tokens per sample on average), b samples in a batch to be inference.Standard prompting is a special case where b = 1.Since most current LLMs (e.g., Codex,PaLM) are based on the Transformer decoder-only architecture, we focus on the time cost of the auto-regressive decoder.
The plain transformer time complexity for decoding one token is O(n 2 d), i.e., the time for encoding the embeddings of input tokens, where n the length of input tokens and d is the dimension of embeddings.With the caching of previous tokens, the time complexity to decode each of the rest tokens is O(nd).We omit d since it is a constant.Thus, the time of one inference to decode C • b tokens: where T encode is the time for encoding the input tokens in the decoder, and T decode is the time for decoding the rest tokens.C can be seen as a constant.One inference time T regarding K and b is: Though the numbers are not accurate considering the constant coefficients of Big O time complexity, we can learn the decoding time increase can not be overlooked as b becomes large.We do not emphasize this part in Section 2.4 because the overhead and rate limit blocking time of the OpenAI API make up the most proportion of time cost, and thus reducing the N times of API calls to N/b times almost inverse linearly reduce the time cost (see Figure 2).
However, if the overhead and rate limits are no longer the bottlenecks, e.g., rate limits are strict for Codex (code-davinci-002), GPT-3.5 (gpt-3.5turbo)and  but not a big issue to GPT-3 (text-davinci-003), then the decoding time increase will be non-negligible.

B More Experimental Results
We list results for all experiments (Tables 6-9).For the WikiTQ experiment with Binder, the LLM generation temperature is 0.4 following its paper.For the other experiments, the temperature is 0. For all experiments, top_p = 1, sampling_n = 1, logprobs = 1, and stop_tokens = \n\n.Five Ope-nAI keys are used as a polling pool on rotation to request the OpenAI API of Codex (the rate limit errors still occur in the experiments and are counted into time cost since it is a practical issue).If fewer OpenAI keys are used, there should be more rate limit errors because the request interval for one key will be shorter.
We follow Binder (Cheng et al., 2022) and Program-of-Thought (Chen et al., 2022) to build the prompts of WikiTQ, GSM8K (program), and SVAMP (program).For RTE, MNLI, SST-5, we design the prompts ourselves using Chain-of-Thought.For prompts with fewer than 12 in-context exemplars, we manually add to 12 samples using samples from the training set.We show batch prompting prompts with b = 4 as examples.For different b, we group the same 12 samples according to b.When using ChatGPT in Section 3.4, the prompt format differs from Codex and GPT-3 because its conversational capability.See Table 17.A [4]: A triple digit number would be equal to at least 100.The judo dan-rank system was capped at 10th dan after the death of judo's founder, Kanō Jigorō.Thus, the judo rank system does not reach the triple digits.So the answer is no.A [4]: Olivia had 23 dollars.5 bagels for 3 dollars each will be 5 x 3 = 15 dollars.So she has 23 -15 dollars left. 23 -15 is 8.The answer is 8.
Q[1]: A garden produced 237 potatoes, 60 fewer cucumbers and twice as many peppers than the cucumbers.How many vegetables did the garden produce?Q[2]: John's cow weighs 400 pounds.It increased its weight to 1.5 times its starting weight.He is able to sell the cow for $3 per pound.How much more is it worth after gaining the weight?Q[3]: John writes 20 pages a day.How long will it take him to write 3 books that are 400 pages each?Q [4]: James has a rainwater collection barrel.For each inch of rain he collects 15 gallons.On Monday it rained 4 inches and on Tuesday it rained 3 inches.He can sell water for $1.2 per gallon.How much money did he make from selling all the water?A  SST-5 Prompt Q[1]: a stirring , funny and finally transporting re-imagining of beauty and the beast and 1930s horror films.Q[2]: they presume their audience wo n't sit still for a sociology lesson, however entertainingly presented, so they trot out the conventional science-fiction elements of bug-eyed monsters and futuristic women in skimpy clothes.Q[3]: um , no..Q [4]: jonathan parker's bartleby should have been the be-all-end-all of the modern-office anomie films.
A[1]: The tone is very positive.
A[2]: The tone is negative.
A[3]: The tone is neutral.
A [4]: The tone is positive.
Q[1]: lacks the inspiration of the original and has a bloated plot that stretches the running time about 10 minutes past a child's interest and an adult's patience.Q[2]: the santa clause 2 proves itself a more streamlined and thought out encounter than the original could ever have hoped to be.Q[3]: you might say tykwer has done all that heaven allows, if you wanted to make as anti-kieslowski a pun as possible.Q [4]: otto-sallies has a real filmmaker's eye.
A[1]: The tone is very negative.
A[2]: The tone is positive.
A[3]: The tone is neutral.
A [4]: The tone is positive.
Q[1]: with a confrontational stance, todd solondz takes aim on political correctness and suburban families.Q[2]: verall , cletis tout is a winning comedy that excites the imagination and tickles the funny bone.Q[3]: with its parade of almost perpetually wasted characters ... margarita feels like a hazy high that takes too long to shake.Q [4]: an ugly-duckling tale so hideously and clumsily told it feels accidental.
A[1]: The tone is neutral.
A[2]: The tone is very positive.
A[3]: The tone is negative.
A [4]: The tone is very negative.
Ali had $21. Leila gave him half of her $100.How much does Ali have now?
Response A: Leila gave 100/2=50 to Ali.Ali now has $21+$50 = $71.The answer is 71.Ali had $21. Leila gave him half of her $100.How much does Ali have now?Q[2]: A robe takes 2 bolts of blue fiber and half that white fiber.How many bolts?-

Figure 1 :
Figure 1: Illustration of batch prompting compared with standard prompting.Batch prompting groups multiple samples in one batch (b = 2 in the figure) and lets the LLM generate multiple responses (highlighted in yellow) for the batch in inference.

Figure 2 :
Figure 2: Token and time costs per sample on three datasets for illustrations (other datasets show similar trends).Batch prompting significantly lowers both token and time costs as the number of samples in each batch increases.

Figure 3 :
Figure 3: Accuracy over varying numbers of batch samples b on five datasets using batch prompting.The performance decreases with larger b.

Figure 3
Figure 3 illustrates the impact of the number of samples per batch, b, on batch prompting performance.Performance typically decreases as b increases, with a significant drop at b = 6 across four out of five datasets.However, the optimal performance isn't always at b = 2. Selecting b = 3 or b = 4 often yields good performance while conserving more tokens and time.The time/token cost reductions diminish as b grows, suggesting b < 6 (given 12 in-context examples in experiments) as a good balance between costs and performance.

Figure 4 :
Figure 4: Accuracy on WikiTQ of various table input strategies and b (the number of samples in each batch).This studies how the input length affects batch prompting performance.b = 1 means standard prompting.Average input tokens per table are 24, 58, and 216 tokens.As the number of batch samples increases, batch prompting suffers in downstream performance.
3)Thus, increasing b in batch prompting will also increase the time cost of one inference.The influence of b also increases with its value and is relatively marginal when b is small, especially when b ≪ K, which is a common practice (b = 1) in few-shot in-context learning.We can see a few examples by setting K = 12 (as in experiments), C = 100 with varying b in Table5according to equation 3. Time(no unit) per inference with K = 12, C = 100 and various b.

Table 2 :
Accuracy of different LLMs with standard prompting and batch prompting using CoT prompts.

Table 3 :
Accuracy from various batching methods on five representative datasets.Similarity or diversitybased methods do not achieve performance gains.

Table 4 :
(Pasupat and Liang, 2015)soning methods with standard and batch prompting.Batch prompting can be applied well showing similar or better performance.affectperformance:batchpromptingtends to degrade performance more significantly with longer input contexts.We validate our assumption with WikiTQ(Pasupat and Liang, 2015), a challenging TableQAdataset.Tables contain longer input tokens for their multiple rows and columns.We experiment with increasing table input lengths: a simplified table schema (i.e., column names without column types; avg.24 tokens/table), a table schema (avg.58 tokens/table), and a table schema with three table rows (avg.216 tokens/table).

Table 6 :
Batch prompting accuracy with different b (the number of samples in batch) compared with standard prompting on ten datasets.All use Codex (code-davinci-002) as the LLM and Chain-of-Thought as the reasoning method.

Table 7 :
Batch prompting time per sample with different b (the number of samples in batch) compared with standard prompting on ten datasets.All use Codex (code-davinci-002) as the LLM and Chain-of-Thought as the reasoning method.

Table 8 :
Accuracy on WikiTQ of various table input strategies and b (number of samples in batch) using Binder(Cheng et al., 2022)to generate programs with Codex (code-davinci-002).

Table 9 :
[4]uracy on GSM8K and SVAMP with varying b (number of samples in batch) usingProgram-of- Thought (Chen et al., 2022)to generate programs with Codex (code-davinci-002).Hamsters are prey animals.Prey are food for predators.Thus, hamsters provide food for some animals.So the answer is yes.A[2]: Brooke Shields went to Princeton University.Princeton University is about as academically rigorous as the University of Pennsylvania.Thus, Brooke Shields could also succeed at the University of Pennsylvania.So the answer is yes.A[3]: Hydrogen has an atomic number of 1. 1 squared is 1.There are 5 Spice Girls.Thus, Hydrogen's atomic number squared is less than 5.So the answer is no.A[4]: College commencement ceremonies can happen in December, May, and June.December is in the winter, so there can be frost.Thus, there could be frost at some commencements.So the answer is yes.Can an Arvanite Greek understand some of the Albanian Declaration of Independence?Q[4]: Can Burundi's communicate with citizens of New Brunswick?A[1]: The War in Vietnam was 6 months.The gestation period for a llama is 11 months, which is more than 6 months.Thus, a llama could not give birth twice during the War in Vietnam.So the answer is no.A[2]: The density of a pear is about 0.6g/cm3, which is less than water.Objects less dense than water float.Thus, a pear would float.So the answer is no.A[3]: The Arvanite Greek's are a major Tosk speaking group of southern Albania.Thus, they can understand some of the Albanian Declaration of Independence.So the answer is yes.A[4]: French is one of the official languages of Burundi.Thus, Burundi's can communicate with citizens of New Brunswick.So the answer is yes.The Chinese calendar has a number of symbols including monkeys, goats, and tigers.Tigers have four paws and balance themselves by walking on their toes.Thus, quadrupeds are represented on the Chinese calendar.So the answer is yes.A[2]: Contestants must be at least 16 years of age to compete in the finals of Eurovision Song Contest.Dafne Keen is 15 years old in 2020.Thus, Dafne Keen cannot win the Eurovision Song Contest finals in 2020.So the answer is no.A[3]: Students in the eleventh grade are typically 16-17 years of age.To serve as president, one must be at least 35 years old.Thus, a student in eleventh grade would be unable to run for president of the United States.So the answer is yes.

Table 11 :
[4]ategyQA Prompt.There are 15 trees in the grove.Grove workers will plant trees in the grove today.After they are done, there will be 21 trees.How many trees did the grove workers plant today?Q[2]: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?Q[3]: Leah had 32 chocolates and her sister had 42.If they ate 35, how many pieces do they have left in total?Q[4]: Jason had 20 lollipops.He gave Denny some lollipops.Now Jason has 12 lollipops.How many lollipops did Jason give to Denny?A[1]: There are 15 trees originally.Then there were 21 trees after some more were planted.So there must have been 21 -15 = 6.The answer is 6.A[2]: There are originally 3 cars.2morecarsarrive.3+ 2 = 5.The answer is 5. A[3]: Originally, Leah had 32 chocolates.Her sister had 42.So in total they had 32 + 42 = 74.After eating 35, they had 74 -35 = 39.The answer is 39.A[4]: Jason started with 20 lollipops.Then he had 12 after giving some to Denny.So he gave Denny 20 -12 = 8.The answer is 8. Q[1]: Shawn has five toys.For Christmas, he got two toys each from his mom and dad.How many toys does he have now?Q[2]: There were nine computers in the server room.Five more computers were installed each day, from monday to thursday.How many computers are now in the server room?Q[3]: Michael had 58 golf balls.On tuesday, he lost 23 golf balls.On wednesday, he lost 2 more.How many golf balls did he have at the end of wednesday?Q[4]: Olivia has $23.She bought five bagels for $3 each.How much money does she have left?A[1]: Shawn started with 5 toys.If he got 2 toys each from his mom and dad, then that is 4 more toys.5+4 = 9.