Empirical Study of Zero-Shot NER with ChatGPT

Large language models (LLMs) exhibited powerful capability in various natural language processing tasks. This work focuses on exploring LLM performance on zero-shot information extraction, with a focus on the ChatGPT and named entity recognition (NER) task. Inspired by the remarkable reasoning capability of LLM on symbolic and arithmetic reasoning, we adapt the prevalent reasoning methods to NER and propose reasoning strategies tailored for NER. First, we explore a decomposed question-answering paradigm by breaking down the NER task into simpler subproblems by labels. Second, we propose syntactic augmentation to stimulate the model's intermediate thinking in two ways: syntactic prompting, which encourages the model to analyze the syntactic structure itself, and tool augmentation, which provides the model with the syntactic information generated by a parsing tool. Besides, we adapt self-consistency to NER by proposing a two-stage majority voting strategy, which first votes for the most consistent mentions, then the most consistent types. The proposed methods achieve remarkable improvements for zero-shot NER across seven benchmarks, including Chinese and English datasets, and on both domain-specific and general-domain scenarios. In addition, we present a comprehensive analysis of the error types with suggestions for optimization directions. We also verify the effectiveness of the proposed methods on the few-shot setting and other LLMs.

Information extraction (IE) is a fundamental topic in NLP, which aims to extract structured information from unstructured text, including tasks such as named entity recognition (NER) (Yu et al., 2020), relation extraction (RE) (Baldini Soares et al., 2019), event extraction (EE) (Chen et al., 2015), etc. Evaluating ChatGPT's performance on IE is important for understanding its capabilities in structured prediction and language understanding (Li et al., 2023;Wei et al., 2023).
With recent techniques for eliciting complex multi-step reasoning (Wei et al., 2022;Wang et al., 2022b), LLMs have shown remarkable zero-shot reasoning ability in arithmetic and symbolic reasoning (Kojima et al., 2022).However, the reasoning ability of LLM on IE remained unexplored.To mitigate this gap, we present a systematic empirical study exploring the reasoning capability of LLM on IE, with a focus on the ChatGPT and zero-shot NER task.By adapting the prevalent reasoning techniques (Zhou et al., 2022;Wei et al., 2022;Wang et al., 2022b) to NER, we propose the following strategies to stimulate the reasoning potential of LLM on NER: • We break down the NER task into a series of simpler subproblems by labels and perform a decomposed-question-answering (Decomposed-QA) paradigm, where the model extracts entities of only one label at a time.
• We propose syntactic augmentation of two ways: syntactic prompting, which encourages the model to first analyze the syntactic structure of the input text itself, then recog-nize the named entities based on the syntactic structure; tool augmentation, which provides the syntactic information generated by a parsing tool to the model.
• We tailor the self-consistency (SC) (Wang et al., 2022b) for NER and propose a twostage majority voting strategy: after sampling multiple responses of the model, we first vote for the most consistent mentions, then the most consistent types.
The main contributions of this paper include: • We present a systematic empirical investigation of zero-shot NER with LLM, with a specific emphasis on ChatGPT as one of the most robust LLMs available.
• We adapt prevalent reasoning methods to NER and propose four strategies tailored for NER: decomposed-QA, syntactic prompting, tool augmentation, and two-stage majority voting.
• We evaluate our strategies across seven benchmarks.Experiment results reveal that the proposed strategies significantly facilitate zero-shot NER across domain-specific outof-distribution and general-domain datasets, including Chinese and English scenarios.
2 Related Work

Reasoning with LLM
LLM has shown remarkable zero-shot reasoning ability, in the way of explicitly encouraging the LLM to generate intermediate rational for solving a problem.On the one hand, recent works, in both the few-shot (Wei et al., 2022;Zhang et al., 2022;Wang et al., 2022a) and zero-shot (Kojima et al., 2022) setting, elicit chain-of-thought (CoT) from LLM and modify the answer by step-by-step.On the other hand, the problem decomposition, like least-to-most prompting (Zhou et al., 2022), reduces complex problems to the sub-problems, and then solves these sub-problems sequentially; the SC strategy (Wang et al., 2022b) generates a diverse set of answers by sampling from LLM, and then marginalizes out the sampled answers to determine the optimal answer.In this work, we focus on investigating the zero-shot reasoning ability of LLM on the NER task.

LLM on IE
A few works study the performance of the powerful LLM ChatGPT (Li et al., 2023;Ma et al., 2023;Laskar et al., 2023)

Method
Adapting the prevalent reasoning techniques (Zhou et al., 2022;Wei et al., 2022;Wang et al., 2022b) to NER, we propose four strategies to stimulate the reasoning capabilities of LLM on NER.Examples of the proposed methods are shown in Fig. 1.

Decomposed-QA
Inspired by least-to-most prompting (Zhou et al., 2022), we improve zero-shot NER by decomposing the task into a set of simpler questions.Recognizing entities of all labels at one time may be too challenging for ChatGPT (as the vanilla zero-shot method shown in (a) of Fig. 1), especially when the label size is large, or the data is from a specific out-of-distribution domain.This motivates us to break down the NER task by labels.Given an input sentence, the whole process of recognizing entities is a multi-turn dialogue paradigm.Each time, Chat-GPT is asked to recognize entities of a single label.
After ChatGPT provides its response to the current question, we proceed to ask questions related to the next label, incorporating all the previous questions and answers as part of the dialogue context.Once all questions pertaining to each label have been addressed, we conclude the entire conversation.
We name this paradigm Decomposed-QA.The {ChatGPT response} (questions of each label) ...

(c) Syntactic Prompting
Given entity label set: {label set} Given the text and the corresponding  We obtain the label order used in the multi-turn dialogue by asking ChatGPT.For each dataset, we provide the task requirement and the label set to ChatGPT, then ask it to give a reasonable label order based on its understanding of the labels.For domain-specific datasets, PowerPlantFlat and Pow-erPlantNested, which will be introduced in Section 4.1, we also use a manual label order provided by the domain experts.The label orders are shown in Appendix G.

Syntactic Augmentation
Aiming to guide the model to think step by step while extracting information, we encourage Chat-GPT to first grasp the syntactic structure of the input text and then leverage this syntactic structure to extract relevant information.Among them,  five kinds of syntactic information are utilized: word segmentation, noun phrases, Part-of-Speech (POS) tags, constituency trees, and dependency trees.Word segmentation is only for Chinese.We propose the following two ways of syntactic augmentation.
Syntactic Prompting.We encourage the model to analyze the syntactic structure itself by inserting the syntactic reasoning hint in the input instruction, as shown in (c) of Fig. 1.We explore two positions of syntactic reasoning hint, i.e., in the back or front of the instruction, as shown in Fig. 2.

Tool Augmentation.
We first obtain the syntactic information of the input text via a parsing tool;2 Then, we feed the input text together with the syntactic information to ChatGPT, as shown in (d) of Fig. 1.We do not apply noun phrases in tool augmentation since we do not obtain a parsing tool with a reliable ability to extract noun phrases.
We further explore the combination of tool augmentation and syntactic prompting.To enhance the utilization of syntactic information from the parsing tool, we insert a syntactic reasoning hint.The example is shown in (e) of Fig. 1.

Self-Consistency with Two-Stage Majority Voting
Harnessing the power of SC (Wang et al., 2022b), we sample multiple responses from the model and select the most acknowledged answers as the final prediction.We design a two-stage majority voting for NER, as shown in (f) of Fig. 1.At stage one, for each candidate mention appeared in all responses, we consider it as an entity if it appeared in more than half of the responses; otherwise, we discard this mention.At stage two, for each mention kept in stage one, we choose the entity label predicted by the majority of responses as the final predicted label.
We explore two levels of SC for decomposed-QA: question-level and sample-level.For questionlevel, we sample multiple responses for the current question and conduct majority voting; then, we fill the voted answer into the dialogue context for all subsequent questions.For sample-level, we run the whole dialogue multiple times independently and obtain the answer of each run, then conduct majority voting on these answers.

Setup
Datasets.We evaluate ChatGPT performance on both domain-specific and general-domain datasets.For domain-specific datasets, we present two Chinese NER datasets of the electric power domain, PowerPlantFlat (PPF) and PowerPlantNested (PPN).The two datasets are collected from the technical reports, which are formed during nuclear power plant operation and maintenance.Pow-erPlantFlat only contains flat cases, while Pow-erPlantNested contains nested entities.The two datasets are formed in the vertical industrial domain, and thus serve as out-of-distribution data for ChatGPT.The statistics of the two datasets are shown in Appendix A. For general-domain datasets, we evaluate on commonly used benchmarks, including two English datasets, ACE05,3 and ACE04,4 and three Chinese datasets, OntoNotes 4 (Onto.4),5 MSRA (Zhang et al., 2006) and Weibo NER (Peng and Dredze, 2015).For evaluation on more datasets, please refer to Appendix E.
Model.We mainly evaluate on GPT-3.5 (gpt-3.5turbo)with official API.6For Decomposed-QA, we maintain a dialogue paradigm for each test sample.For vanilla setting, we generate the response separately for each test sample.
We also evaluate on GPT-3 (text-davinci-003) (Ouyang et al., 2022) and Llama2 (Touvron et al., 2023) to verify the effectiveness of the proposed methods on other LLMs.We use the 13B chat model of Llama2. 7The results of these two LLMs are in Section 4.5.
Self-consistency.We set the temperature to 0.7 and 0 for settings with and without SC, respectively.For cost saving, we conduct majority voting of 5 responses in our main experiments.We first conduct both question-level and sample-level consistency on each dataset; then, we choose the way of higher performance for the rest of the experiments on the corresponding dataset.
Data sampling.For syntactic augmentation, we evaluate on the entire test sets of seven datasets.For SC and combinations of techniques, for cost saving, we evaluate on partial datasets and randomly sampled subsets of test sets: We evaluate on the two domain-specific datasets, PowerPlantFlat and PowerPlantNested, with entire test sets, and two general-domain datasets, Ontonotes 4 and ACE05, by randomly sampling 300 samples from the test set three times and reporting the average results.SOTA of fully-supervised methods.For Pow-erPlantFlat and PowerPlantNested, we use Global-Pointer (Su et al., 2022) since it performs well on both flat and nested cases.For other benchmarks, we refer to corresponding papers: Weibo (Wang et al., 2021), MSRA (Li et al., 2020), Ontonotes 4 (Li et al., 2020), ACE05 (Zhong and Chen, 2021), ACE04 (Zhong and Chen, 2021).

Overall Performance
Table 1 summarizes the performances of decomposed-QA and syntactic augmentation.
For the two domain-specific datasets, we use the manual label orders as they show better performance in preliminary experiments.For cost saving, we explore SC and combinations of reasoning techniques on selected datasets and sampled test sets, which are detailed in Section 4.3

Effect of Decomposed-QA
From Table 1, we have the following observations: (1) Compared with the vanilla method, decomposed-QA achieves significant improvements across all benchmarks, including both Chinese and English scenarios, and both domainspecific and general-domain scenarios.This demonstrates that decomposing by labels makes the NER task much more manageable for Chat-GPT.
(2) Decomposed-QA exhibits more significant improvements on domain-specific datasets (with an average 9.22% F1 gain) than on generaldomain datasets (with an average 3.82% F1 gain).This is presumably because out-of-distribution data are more challenging for ChatGPT.Decomposing makes ChatGPT acquire a better understanding of the out-of-distribution data.
(3) We also explore the effect of reasoning techniques under the vanilla setting, and the results are in Table 9 of Appendix C. We found that the vanilla setting fails to stimulate the potential of reasoning techniques.Contrarily, decomposed-QA stimulates the potential of syntactic augmentation.

Effect of Syntactic Augmentation
As shown in Table 1, we draw the following conclusions: (1) Syntactic prompting alone brings limited benefits.This is presumably because conducting syntactic analysis without any other augmentation is challenging for ChatGPT.
(2) Tool augmentation exhibits consistent improvements across six datasets, showing that syntactic information helps ChatGPT better understand the input text.
(3) Tool augmentation achieves more improvements on Chinese than English datasets.This may be due to the fact that Chinese is harder than English for Chat-GPT to handle, and syntactic information provides a clue on how to understand the Chinese input better.( 4) Different kinds of syntactic information exhibit various performances.On Chinese datasets, word segmentation shows the best performance.On English datasets, POS tags boost the most.This is presumably because simpler syntactic information is easier for ChatGPT to understand.Complex syntactic information, such as dependency tree, though informative, can be hard to understand, thereby, exhibiting unstable performance.in Fig. 3 for better analysis.

Effect of Self-Consistency and Combinations of Reasoning Techniques
From the table and the figure, we have the following observations and conclusions: (1) SC shows consistent improvements on almost all methods.As long as the syntactic information is involved, SC can always boost performance.This may be due to the fact that syntactic information is helpful but hard to understand or analyze.Thus, syntactic information gives ChatGPT the potential to perform better but also a higher possibility of making mistakes.SC can filter out errors, thereby, leveraging the advantages, and eliminating the disadvantages of syntactic information.(2) Syntactic prompting fails to boost tool augmentation and even hurts the performance.However, when equipped with SC, syntactic prompting improves tool augmentation.This may be due to the complexity of information provided by the combination of tool augmentation and syntactic prompting.The complex information leads the model to think and explore more, and of course, it is also accompanied by more possibilities for errors.This makes SC an effective means of filtering out errors here.(3) SC improves more when syntactic reasoning hints are put on the back than on the front.This is presumably because the closer the reasoning hint is to the answer, the more it can stimulate the model's thinking.Hence, putting the reasoning hints on the back encourages the model to generate more diverse answers, which provides better search spaces for majority voting.We explore the effect of increasing sampled responses in SC, which are shown in Fig. 4. We sample up to 30 responses for cost saving.As seen in the figure, sampling a higher number of responses improves the performance.We conjecture that combining diverse syntactic information may further benefit SC on NER.
Boundary.Contain gold.: predicted mentions containing gold mentions; Contained by gold.: predicted mentions contained by gold mentions; Overlap with gold.: predicted mentions not in the two above situations but overlap with gold mentions.
Completely-O: predicted mentions that do not have any of the three above boundary situations with any gold mentions.
OOD mentions: predicted mentions that do not appear in the input text.
As shown in Fig. 6, the majority error types are complete-O and wrong types, which account for over 80% of all errors.The former may be due to the incomplete annotation or that ChatGPT would guess entities based on its prior common knowledge.The latter may be due to the inadequate understanding of entity types.As seen in Table 3, decomposed-QA reduces the total error numbers by 9.4%; The combination of Tool augmentation, Syntactic prompting and SC (TS-SC) reduces the error numbers by 24.1%, showing remarkable capability in error corrections.

Case Study of Error Correction and Error Increase
As seen in Table 3, TS-SC reduces errors mainly in types of contain gold.and completely-O, and increases errors mainly in types of contained by gold.and omitted mentions.Thus, we conduct case study on these four types, which are shown in Fig. 5. TS-SC corrects errors of contain gold.and completely-O presumably by providing syntactic information and making the model better understand the input text.Meanwhile, TS-SC increases errors of contain gold.and omitted mentions presumably because of the misguiding of syntactic information and inadequate understanding of entity types, respectively.For the former, providing more accurate and comprehensive syntactic information might be a solution; for the latter, providing type information might be a direction of optimization.Table 4: Results under few-shot setting, where the number of shots is the number of texts.We randomly sample three sets of demonstrations and take the averages.Results for Ontonotes 4 are averaged over three sets of randomly sampled 300 samples from the test set.We report F1 values.Numbers in parentheses are the standard deviations.Numbers in bold are the best results.Our methods also achieve significant improvements in few-shot scenarios.

More analysis
Few-shot setting.We evaluate the proposed syntactic augmentation under few-shot setting.Han et al. (2023) investigate standard CoT on NER by generating intermediate rationales with ChatGPT.We take a different perspective: we encourage the model to explore syntactic information as their intermediate thinking steps.Detailed adaptations of our methods to the few-shot setting are explained in Appendix D. For the decomposed-QA and SC, we leave them to future work due to the cost budget.We compared our methods to the vanilla method and standard CoT.We use ChatGPT to generate rationales in standard CoT, following (Han et al., 2023).Here, we use one general domain dataset, Ontonotes 4, and one domain-specific dataset, Pow-erPlantFlat, for demonstrations.The results are shown in Table 4, in which the word segmentation is used for demonstration.The results of various syntactic information are in Appendix D.
As observed in Table 4, the standard CoT does not bring improvements, even hurt the performance.This is presumably because standard CoT is very sensitive to the rationales constructed, which is also mentioned in (Han et al., 2023).However, our strategies have achieved significant improvements.This shows that the proposed methods are effective not only in the zero-shot scenario but also in the few-shot setting.
Other LLMs.We also evaluate our methods on GPT-3 (text-davinci-003) (Ouyang et al., 2022) and Llama2 (Touvron et al., 2023).Since Llama2 still has poor support for Chinese yet, we evaluate on two English datasets, one general-domain dataset, ACE05, and one biomedical dataset, BC5CDR (Li et al., 2016).The results are shown in Table 5, in which the dependency tree is used for demonstration.The complete results are in Appendix F. The main results of BC5CDR are in Appendix E. Table 5 shows that our methods exhibit consistent improvements across different LLMs, including the close-sourced ChatGPT model series and typical open-sourced model Llama.

Conclusion
We present an empirical study of zero-shot NER with ChatGPT, with four proposed strategies to simulate the reasoning potential of ChatGPT on NER.Inspired by the powerful reasoning capabilities of LLM on logical and arithmetic reasoning tasks, the proposed strategies involve task decomposition, syntactic augmentation, and tailored SC.We verify the effectiveness of our methods on Chinese and English scenarios, and on both domain-specific and general-domain datasets.We provide an analysis of the error types with suggested solutions.Besides, we verify the effectiveness of the proposed methods on the few-shot setting and other LLMs.

Limitations
For cost saving, we focus on the investigation of each individual syntactic information and have not explored the combinations of different kinds of syntactic information.Also, we have not investigated manual label orders on general-domain datasets for the same reason.We leave them to future work.
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6470-6476, Online.Association for Computational Linguistics.

C Performance Under Vanilla Setting
We also investigate the effect of our proposed reasoning techniques on the standard setting.The results are shown in Table 9.From the table, we can conclude that the potential of the syntactic information cannot be fully exploited under the standard setting.On the contrary, the proposed decomposed-QA paradigm effectively utilizes the syntactic information, as shown in Tabel 2 and Figure 3.Under standard setting, the reasoning techniques bring limited benefits for general-domain datasets, sometimes even hurting the performance.However, these techniques exhibit improvements on domainspecific datasets, i.e., out-of-distribution datasets.This is presumably because out-of-distribution data is much more challenging than general-domain data for ChatGPT.The reasoning techniques lead the model to have a better understanding of the out-of-distribution data.

D Syntactic Augmentation Under
Few-shot Setting The following are the adaptations of our proposed syntactic augmentation strategies to the few-shot setting.
(1) Syntactic prompting: For the test sample, we ask the model to first perform syntactic analysis and then recognize entities.For demonstrations, we use parsing tools to generate intermediate syntactic parsing results.(2) Tool augmentation: We provide both the text and syntactic information for the demonstrations and the test sample.
Table 10 shows the experiment results under the 3-shot setting.
The results are shown in Table 11.We found that the proposed reasoning techniques cannot guarantee performance improvements on CoNLL-2003 and WNUT-17 and even hurt the performance.This is presumably because the label logic of these two datasets is not suitable for decomposition, and the syntactic information generated for them is noisy.
Meanwhile, we conjecture that this is also due to the fact that CoNLL-2003 and WNUT-17 contain more numbers of shorter texts, on which the reasoning techniques are difficult to leverage their advantages.However, the proposed methods achieve significant improvements in biomedical domain datasets BC5CDR, BioNLP11, and CRAFT.This demonstrates that the proposed methods can also improve zero-shot NER of other challenging domains besides the electric power domain.Plus the five datasets evaluated in Table 11, we evaluate on twelve benchmarks in total and achieve remarkable improvements on ten datasets among them.

F Evaluation on Other LLMs
The complete results on GPT-3 and Llama2 are shown in Table 12.These results show that our methods exhibit consistent improvements across different LLMs.On the smallest LLM evaluated, Llama2 13B, our proposed strategies still achieve remarkable performance improvements, with 19.72% and 17.51% F1 improvements on ACE05 and BC5CDR, respectively.This reveals that our methods have wide applicability to various sizes of LLMs, which is beneficial for low-resource scenarios such as when only smaller LLMs are affordable.

G Label Order
Table 13 displays label orders used in our main experiments and the corresponding results under basic decomposed-QA.On the power plant datasets, manual label orders provided by domain experts achieve significantly better results.This demonstrates that when dealing with domain-specific datasets with ChatGPT, one may turn to domain knowledge to boost performance.
Table 14 displays the label orders of additional datasets.
Table 15 shows the instructions for asking Chat-GPT to provide label orders of PowerPlantFlat and ACE05 datasets.

H Prompts
We show all of our prompts with Ontonotes 4 and ACE05 as examples.The prompts are in Input text: Could Tony Blair be in line for a gold medal?Gold label: {'Tony Blair': 'Person'} Label set: ['Person', 'Organization', 'Location', 'Facility', 'Weapon', 'Vehicle', 'Geo-Political Entity'] Given entity label set: {label set} Based on the given entity label set, please recognize the named entities in the set: {label set} Based on the given entity label set, please recognize the named entities in the given text.Text: {input text} Question: What are the named entities labeled as 'Person' in the text?Answer: First, let's perform Parf-of-Speech tagging.Then, we recognize named entities based on the Part-of-Speech tags.{ChatGPT response} (questions of each label) ...(c) Syntactic Prompting Given entity label set: {label set} Based on the given entity label set, please recognize the named entities in the given text.Text: {input text} Question: What are the named entities labeled as 'Person' in the text?Answer: First, let's perform Parf-of-Speech tagging.Then, we recognize named entities based on the Part-of-Speech tags.

Figure 1 :
Figure 1: Examples of proposed methods for zero-shot NER with ChatGPT.(a) Vanilla zero-shot method.(b) Basic decomposed-QA, where the NER task is broken down into simpler subproblems.(c) Decomposed-QA with syntactic prompting.Texts in green are the proposed syntactic reasoning hint .(d) Decomposed-QA with tool augmentation.Texts in orange are the content of syntactic information.(e) Decomposed-QA with tool augmentation and syntactic prompting.(f) SC with two-stage majority voting, where stage one votes for the mentions and stage two votes for types.We use part-of-speech tags as an example syntactic information in this figure.The detailed prompts are shown in Appendix H. example is shown in (b) of Fig. 1.We obtain the label order used in the multi-turn dialogue by asking ChatGPT.For each dataset, we provide the task requirement and the label set to ChatGPT, then ask it to give a reasonable label order based on its understanding of the labels.For domain-specific datasets, PowerPlantFlat and Pow-erPlantNested, which will be introduced in Section 4.1, we also use a manual label order provided by the domain experts.The label orders are shown in Appendix G.

Figure 2 :
Figure 2: Two positions of syntactic reasoning hint.

Figure 3 :
Figure 3: Performance of combinations of reasoning techniques.For methods involving syntactic augmentation, we plot the average results over all kinds of syntactic information.The vertical lines on the top part of some bars represent the performances range over all kinds of syntactic information.With SC of two-stage majority voting, the combinations of reasoning techniques further improve the performances.

Figure 4 :
Figure 4: Increasing sampled responses generally improves performance under SC with two-stage majority voting.
Better understanding of the input text due to syntactic information.
-Increased Error Types: Omitted mention Possible Reason: Inadequate understanding of entity types.Optimization Direction: Provide entity type information.

Figure 5 :
Figure5: Case study of error correction and error increase with the proposed methods.We translate the original Chinese text into English in the demonstrations for readability.The upper two cases are errors corrected, and the lower two are errors increased.Texts in blue are involved entities in the error cases.Our method shows effectiveness on error corrections.With the suggested optimization strategies, the error increased might be eliminated.

Figure 7 :
Figure 7: Percentage of different error types on Power-PlantFlat under the vanilla setting.
Wei et al. (2023)t al. (2023)propose a two-stage chatting paradigm for IE.At stage one, it asks ChatGPT to recognize the types of elements; at stage two, it asks ChatGPT to extract the mentions corresponding to each type recognized at stage one.Han et al.

Table 1 :
Overall performance.We report the F1 values.Vanilla for vanilla zero-shot method without any techniques; Syn. for syntactic prompting; Tool.for tool augmentation.We use the same abbreviations in the rest of this paper when necessary.Syntactic augmentation is all conducted under the decomposed-QA setting.Numbers in bold are the best results in the corresponding categories; Numbers underlined are the best results among all methods in the zero-shot scenario.The proposed decomposed-QA and syntactic augmentation achieve significant improvements for zero-shot NER on both Chinese and English datasets and on both domain-specific and general-domain scenarios.

Table 2 :
Performance of SC and combinations of reasoning techniques.We report the F1 values.Numbers in parentheses are the standard deviations.Numbers in bold are the best results in the corresponding categories; Numbers underlined are the best results among all methods in the zero-shot scenario.SC with two-stage majority voting and combinations of reasoning techniques brings further improvements.
Table 2 summarizes the performance of SC and the combinations of reasoning techniques.We visualize the results on PowerPlantFlat and Ontonotes 4

Table 3 :
Numbers of error types on Ontonotes 4. "QA" for decomposed-QA, "TS-SC" for combinations of tool augmentation, syntactic prompting, and SC.Numbers in bold denote the best results, i.e., the least errors.The proposed methods significantly reduce the total amount of error.

Table 5 :
Performance onand Llama2 13B chat model.Results are averaged over three sets of randomly sampled 300 samples from the test set.We report the F1 values.Our proposed strategies show consistent improvements on various LLMs.

Table 6 :
Table 6 shows the overall statistics of the Power-Plant datasets, and Table 8 displays the classwise statistics.Statistics of PowerPlant datasets.

Table 7 :
Numbers of different error types on Power-PlantFlat."QA" refers to decomposed-QA, "TS-SC" refers to the combination of tool augmentation, syntactic prompting, and SC.Numbers in bold denote the best results on PowerPlantFlat, i.e., the least errors.

Table 9 :
Performance of reasoning techniques under the vanilla setting (without decomposition).In this table, "vanilla" specifically refers to the zero-shot method without any techniques.We report the F1 values on entire test sets.We spare the SC on MSRA and Ontonotes 4 for cost saving.

Table 10 :
Performance of syntactic augmentation under 3-shot setting.We randomly sample three sets of demonstrations and report the means and standard deviations of F1 values.Numbers in parentheses are standard deviations.Numbers in bold are the best results in each category.The proposed syntactic augmentation exhibits significant improvements in the few-setting.

Table 11 :
Performance on additional datasets.Results are averaged over three sets of randomly sampled 300 samples from the test set.We report the means and standard deviations of F1 values.Numbers in parentheses are standard deviations.Numbers in bold are best results in each category.