Prompting Language Models for Linguistic Structure

Although pretrained language models (PLMs) can be prompted to perform a wide range of language tasks, it remains an open question how much this ability comes from generalizable linguistic understanding versus surface-level lexical patterns. To test this, we present a structured prompting approach for linguistic structured prediction tasks, allowing us to perform zero- and few-shot sequence tagging with autoregressive PLMs. We evaluate this approach on part-of-speech tagging, named entity recognition, and sentence chunking, demonstrating strong few-shot performance in all cases. We also find that while PLMs contain significant prior knowledge of task labels due to task leakage into the pretraining corpus, structured prompting can also retrieve linguistic structure with arbitrary labels. These findings indicate that the in-context learning ability and linguistic knowledge of PLMs generalizes beyond memorization of their training data.


Introduction
The rapid increase in the scale of pretrained language models (PLMs) has led to a new paradigm of NLP modeling: in-context learning, or prompting (e.g., Brown et al., 2020;Raffel et al., 2020). In this setting, the model is used to perform a task directly via the predictions of the LM head without additional finetuning on the target task, often with a few demonstrations of the desired behavior provided within the input. This has led to impressive fewshot performance with PLMs on a wide variety of tasks, ranging from classification to summarization and generation (Liu et al., 2021a). Due to their broad success on tasks requiring language understanding, we hypothesize that these models also contain significant knowledge about linguistics. However, we are not aware of existing prompting methods that can directly test this hypothesis on autoregressive PLMs. Behavioral analysis of PLMs (Belinkov et al., 2020) uses similar methods to prompting to measure knowledge Each predicted label is appended to the context along with the next word to iteratively tag the full sentence. stored in language models (Gulordava et al., 2018;Petroni et al., 2019), but this technique is difficult to generalize to tasks that predict more complex structures. Additionally, current approaches for applying PLMs to linguistic structure prediction tasks finetune on the downstream task (e.g., Ma et al., 2022), which confounds measuring underlying model knowledge.
We propose a new approach, structured prompting, that prompts autoregressive PLMs to probe for word-and span-level linguistics framed as sequence tagging tasks (Section 2). At timestep t, a label for the t-th word in the sequence is decoded from the LM; the model prediction is then fed back into the model along with the next word to progress to timestep t + 1. We evaluate our approach on three sequence tagging tasks (POS tagging, sentence chunking, and NER). Our experiments show that PLMs can perform effective few-shot sequence tagging in the structured prompting setup, and that performance increases with demonstration set size and model size, consistent with other prompting methods (Section 4).
We further analyze structured prompting by examining how the model generalizes to various rep-resentations for labels (Section 5) as well as by analyzing the presence of task data in the pretraining corpus and how this affects model performance (Section 6). These analyses find that we are able to uncover linguistic information from the model in a wide range of settings, indicating that PLMs contain this knowledge in a general manner beyond memorization of the task from pretraining data. Interestingly, we find that while PLMs perform best with meaningful labels (such as original task labels or an English descriptor of the class), the model is also able to learn in context from arbitrary labels.
The contributions of this work are therefore threefold: (1) we introduce a new prompting paradigm, structured prompting, that probes PLMs for sequence knowledge without further training, (2) we find that this approach recovers linguistic structure from PLMs in a few-shot manner, and (3) our analysis of structured prompting provides insights into what aspects of linguistic generalization allow PLMs to learn in context.

Structured Prompting of Pretrained Language Models
We propose a sequential method for performing sequence tagging with PLMs via in-context learning, which we refer to as structured prompting ( Figure  1). The model is provided k (context, tagged sequence) pairs as the task demonstration and the full example sentence to be labeled. The LM head is used to iteratively tag the words in the example with constrained decoding over a fixed set of labels. More specifically, given a set of labels L and an input sequence c containing k demonstration pairs as well as the full text of the example sentence S = s 0 , ..., s n , at each time step t the language model M encodes [c; s t ] and labels s t withˆ t = argmax ∈L P M ( |c, s t ). The input sequence is then updated by appending the current word s t and the predicted labelˆ t to the end of c. Multi-token labels are scored with the average log-likelihood over all tokens P M ( |c) = 1 | | | | i=0 P M (y i |c, y 0 , ..., y i−1 ), where y j is the jth subword token in .
This approach to in-context learning tags an entire sequence with a single pass over the context (with the use of model state caching). It also allows the model to condition on past predictions while labeling the current word. As we demonstrate in Section 4, these features allow us to apply large autoregressive language models to a broad class of core NLP tasks in a few-shot manner.

Prompt Formatting
We use a lightweight prompt format with limited natural language guidance about the task provided to the model as shown in Figure 1; the letters "C" and "T" in the figure represent the inputs "Context" and "Tagged" respectively. For each task, we represent each tag with the token or sequence of tokens that correspond to the surface form of the label provided by the dataset.
Our preliminary experiments with varying the prompt format generally had little effect on performance. Specifically, performance was stable across the choice of delimiter (out of the considered options in {/, :, −, _, =}) and other minor formatting differences. However, including the word in the "Tagged" sequence is important; on GPT-J, performance degrades by 84% on POS and 79% on NER when decoding the label sequence without repeating the word (i.e., "Tagged: DET NOUN...").

Sequence Tagging Tasks
We consider the following English tasks framed as sequence tagging problems in the evaluation of the proposed structure prompting method.
Part-of-Speech (POS) Tagging We evaluate POS tagging performance on English Universal Dependencies (UD) with the UPOS tagset (Nivre et al., 2020). Specifically, we use the treebank annotated on the GUM corpus (Zeldes, 2017). Sentence Chunking Chunking, or shallow parsing, partitions the words in a sentence into various non-overlapping spans of syntactic meaning. We measure the performance of PLMs on sentence chunking using the CONLL2000 dataset from Sang and Buchholz (2000), which frames chunking as a BIO tagging task.
Named Entity Recognition (NER) We evaluate the ability of structured prompting to extract named entities from PLMs with NER. This is measured as a BIO tagging task on the CONLL2003 dataset (Sang and De Meulder, 2003).

GPT-3
We also perform structured prompting with the GPT-3 models (Brown et al., 2020) via the OpenAI API. We use the base GPT-Curie (∼6B parameters) and GPT-Davinci (∼175B parameters) models 1 on POS tagging. Due to the cost of running these models through the API, GPT-Davinci is evaluated with unconstrained greedy decoding, rather than the constrained decoding setup described in Section 2.
In preliminary experiments we also evaluate a number of OPT models . We find their performance was significantly worse and did not scale with model size (up to 66B parameters) on POS tagging and NER. We leave a more thorough examination of this behavior for future work.

Additional Experimental Details
For each experiment, we report the mean and standard error of performance across m runs. For each of these runs, k demonstrations are sampled from the training dataset at random, with the condition that the k demonstrations cover the label space of the task if possible. 2 We use k = 10 sentences as demonstrations and perform m = 5 runs per experiment unless otherwise stated. Each model is evaluated on 1000 examples randomly sampled from the task test set (see Appendix A.1 for a discussion on the effect this choice has on performance estimates). The evaluation subset 1 I.e., with no additional instruction finetuning 2 For smaller ks this constraint is sometimes not met.
is held fixed across all five runs, and the evaluation data and selection of demonstrations for each run is fixed across models for each task.
To obtain the tag sequence for each example, we greedily take the top-1 label (with the highest log likelihood) for each word. 3 For the span labeling tasks involving BIO tagging (chunking, NER), we enforce hard constraints to ensure a valid BIO tag sequence (e.g., I-X tags can only follow a previous B-X or I-X tag). Empirically, we find that enforcing BIO constraints makes little difference in the overall performance of the method; however, we use them as they ensure valid output sequences. Appendix A.2 compares model performance with and without BIO constraints.

Structured Prompting Results
We measure the performance of structured prompting on three sequence tagging tasks. The aims of this evaluation are to (1) validate that structured prompting follows prior prompting setups in terms of model and k-shot scaling trends and (2) investigate the extent to which the approach extracts these structures from the model. We then quantify the types of errors made with structured prompting.

Overall Results
Figure 2 presents the results of our primary structured prompting evaluation. We consider the performance of GPT-NeoX (Black et al., 2022) compared to task baselines: overall majority, in which each word is labeled with the most frequent tag in the training set, and per-word majority, where each word is labeled with the tag it most commonly appeared within the training data (left panel). 4 All baselines are calculated on the full training set and so use more labeled data than the PLM; the perword majority is a particularly strong baseline as words frequently occur with the same tag.
Structured prompting performs effective fewshot sequence tagging We find that GPT-NeoX significantly outperforms each baseline on POS tagging and NER, and the model slightly underperforms the per-word majority baseline on sentence chunking by 4.2 points. Overall, the approach performs worse for the BIO span-labeling tasks than for word-level POS tagging. We hypothesize that this is because the former tasks are more complex, requiring the model to determine spans and more complicated linguistic knowledge.
Structured prompting scales with model and demonstration size We observe that the performance of structured prompting improves with scale across GPT-Neo models (center panel). Model performance also improves with additional demonstrations (right panel); both of these trends are consistent with prior prompting results (e.g., Black et al., 2022). However, how much the additional demonstrations help varies: NER improves more with larger sizes of k than POS and chunking, likely because labeled spans are more sparse in NER.
Notably, in the zero-shot case the model achieves around 17% accuracy on POS tagging when randomly predicting labels would yield 5.8%.
Structured prompting with GPT-3 Table 1 compares two GPT-3 models to the GPT-Neo series  (Gao, 2021). We therefore compare it to the similarly sized GPT-J model in a 5-shot setting. We find that GPT-Curie underperforms GPT-J by 12.7 points; both models also underperform the per-word majority baseline in this setting. We then evaluate the largest GPT-3 model, GPT-Davinci, on POS tagging with greedy unconstrained decoding of the entire output sequence. We find that Davinci performs reasonably well and scores similarly to Curie despite the more difficult decoding setting; many errors arise from format errors in the generated output for longer sentences. If we only evaluate examples prior to format errors, performance on that subset of the evaluation data is 72.85 ± 1.3 at k=5 and 78.04 ± 0.8 at k=10. Figure 3 presents an error analysis of structured prompting; complete analyses for other tasks are provided in Appendix A.3. We first break out performance across runs and evaluate how the choice of in-context examples affects performance (left panel). For POS tagging, the choice of demonstrations does make a difference, with some sets generally performing better than others across models and a performance gap of 4.8 accuracy points between the best and worst run on the 20B parameter model. NER exhibits similar results to POS; however, the performance of different demonstration sets for chunking is much more varied and inconsistent across models.

Error Analysis
Next, we examine the types of errors PLMs make in structured prompting with aggregated confusion matrices across runs (center and right panel). For clarity, the diagonal (representing correct predictions) has been zeroed out and the matrices are normalized. Many of the mistakes made by the 20B parameter model on POS tagging are for syntactically similar roles, such as confusing proper nouns for nouns and labeling auxiliary verbs as verbs. However, for BIO tagging the models are not always well-calibrated: on NER the model most often mislabels "O" tokens, indicating that the model is overpredicting named entities.
Given that the choice of demonstrations affects PLM performance, another consideration is: how consistent are the error types across runs? To investigate this, we calculate the pairwise Spearman correlations between the confusion matrices of each run. For the 20B parameter model these correlations are very high, indicating the model makes similar types of error across runs: on average ρ = 0.77 for POS tagging, 0.83 for NER, and 0.88 for chunking; all pairwise correlations have p-values << 0.001. Additionally, the models seem to become more robust across demonstration sets at scale; confusion matrix correlations for the 2.7B model are lower, though still significant (ρ = 0.71, 0.64, 0.66 for POS, NER, and chunking respectively).

When Does Structured Prompting
Work?
We turn to an investigation of what factors in structured prompting surface linguistic structure from PLMs, using the behavior of GPT-NeoX on POS tagging and NER as a case study. Our goal is to test the extent to which the model can perform these tasks and what makes it fail, in order to better understand how it contains the linguistic information we are interested in. We find that (1) the model can in some cases generalize to labels not seen in the demonstration, and (2) the exact name of the label has a large effect on performance. Specifically, the model exhibits the ability to learn in context when classes are represented by arbitrary labels but will ignore label mappings in the demonstration that are contradictory to its prior task knowledge.

Effect of Seen Labels
In Section 4.1, we see that the model obtains above random chance accuracy on zero-shot POS tagging, suggesting that the model does not need to observe the label to associate it with the correct class. To analyze this we compare the performance of the model when the label is and is not seen in the demonstration, averaged across k-shot runs.
Model performance on unseen tags, and the gain in performance after observing the tag, varies greatly by label class (Figure 4). For some classes in POS tagging, such as ADJ and PUNCT, the model obtains around 50% accuracy without seeing the label. However, on AUX in POS tagging and MISC in NER unseen performance is close to 0%. Furthermore, while observing tags like LOC in NER greatly improves performance, other tags like ADJ and MISC improve much less when seen.

Effect of Label Form
We hypothesize that the behavior observed in Section 5.1 depends on how informative the label form is for the class. We therefore compare the model performance on: (1) the original task labels; (2) shuffled task labels, where the label surface forms are shuffled but underlying class correspondences to words are maintained; and (3) proxy labels, where the classes are represented by arbitrary tokens -here, consecutive integers ranging from 11 to 27 (POS) and from 11 to 14 (NER). (Figure 5).
Label shuffling confuses GPT-NeoX Shuffling the label forms greatly hurts overall model performance, with POS scores decreasing overall by 50.5%, and NER by 65.9%. Some classes are more robust to the shuffled labels than others: the AUX and DET parts-of-speech score within the standard error of the original class performance, whereas ADJ accuracy drops by 96.2% to near zero.
Interestingly, we find that the majority of all mistakes made in the shuffled setting (61.4%) result from the model predicting the true label of the class rather than the shuffled one shown in the demonstration. This happens more frequently for classes whose performance degrades severely when shuffled: 93.9% of errors on the NOUN class occur due to this phenomenon, and across classes, there is a strong correlation between performance degradation and the percent of errors predicting the true label (ρ = 0.69, p < 0.05). This behavior suggests that the model ignores in-context demonstrations of label mappings when the label is closely associated with a specific class by the model, similar to findings in Min et al. (2022).
GPT-NeoX in-context learns with arbitrary proxy labels Model behavior with the proxy labels is closer to the original labels, with performance decreasing by 25.8% on POS and 30.5% on NER. Indeed, on many labels that significantly degrade with label shuffling, the model performs significantly better on the proxy labels (NOUN and CCONJ in POS tagging, PER in NER). These results demonstrate that the model is able to perform in-context learning to extract linguistic structure, even when the tags are uninformative.

Sources of Linguistic Knowledge in Pretraining Corpus
The results in Section 5 demonstrate that the choice of label can have a large effect on structured prompting performance and implies that the model has prior task knowledge. We analyze the contexts in which the labels for POS tagging and NER appear in the Pile (Gao et al., 2020) to better understand what GPT-NeoX learns about the labels during pretraining. The analysis shows that task information is found in the pretraining data, both as labeled examples (Section 6.1) and in other related contexts (Section 6.2). However, no evidence of test data leakage was found. Given these findings, we evaluate the model in a new setting that substitutes an English description of each class (e.g., "adjective", "person") for the label in order to control for label leakage while still providing meaningful labels (Section 6.3).

Task Data Contamination
A likely location of labels is in labeled examples of the task that occur in the pretraining data sources, such as GitHub or web-crawled text. To test this, we search the Pile for instances of labeled POS and NER data. Table 2 shows the overall number of occurrences (Freq.), estimates of task data prevalence, and example contexts for a subset of labels; the complete results are given in Appendix A.4.

POS Tagging
Since the POS data is obtained from UD treebanks, we search the Pile for each label as it would appear in the treebank (with tab whitespace on either side of it, see CCONJ example context). We find a significant amount of UD data   formatted in this manner: up to 33,000 occurrences for an individual label (NOUN). This is unsurprising given that Github -where UD treebanks are hosted -is a source of data for the Pile. However, we find no evidence of test data leakage across any of the POS label occurrences when compared to the GUM treebank (Zeldes, 2017). 6 We also perform a closer analysis of the CCONJ label: we compare each occurrence against all nine English treebanks in UD as well as manually examine it. We find that many CCONJ occurrences can be found in the English Web Treebank (EWT; Silveira et al., 2014) (1052/118/155 from the train/dev/test splits); others match with Parallel Universal Dependencies (PUD; Zeman et al., 2017) (10 occurrences from test set) and ParaTUT (Sanguinetti and Bosco, 2014) (1 occurrence from development set).
Our manual analysis finds that the majority of the CCONJ occurrences are in non-English documents (77%) 7 ; other languages whose treebanks we see include Finnish, German, and Arabic among many others. We also observe that every tab-separated instance of CCONJ occurs in the UD treebank format, indicating this automatic filter is a reasonable estimate of UD data leakage across labels.
NER We find task data leakage for NER to be much more limited than POS: the most frequent label occurs 5,655 times in the Pile (other than "O" which occurs very frequently in many contexts). 6 We also compare the test set against the Pile via other methods (exact document match and searching for individual lines); none of these match any test data against the Pile. 7 This is unsurprising: though the Pile is characterized as an "English text corpus" (Gao et al., 2020), prior work has found similar corpora derived from the web contain significant amounts of non-English text (Blevins and Zettlemoyer, 2022).
Since the CONLL format separates the tags with spaces instead of tabs, it is more difficult to filter for data leakage. Instead, we manually evaluate 100 examples for the BIO labels and give the proportion of the sample that is topical for NER.
Only a subset of relevant occurrences includes labeled data -our analysis found that labeled data is not common, and most cases are single example sentences annotated in a variety of ways that don't necessarily follow the CONLL format (I-MISC example context). Similar to POS tagging, we also find labeled examples in non-English languages; notably, some of the examples observed are incorrectly labeled. 8 This highlights that, while the model sees task data during pretraining, the quality and accuracy of that data are unverified.

Labels in Other Contexts
We also observe tags from our tasks in settings other than labeled data during the data analysis. Specifically, the most common relevant contexts we find them in are task documentation or explanations (see NOUN, DET, and B-ORG example contexts) and code related to the task (I-LOC example context). These contexts are interesting, as they provide information that may help the model learn by explaining the task in natural language or code, rather than by showing input/output pairs.
Of course, we also find instances of labels that are unrelated to the task at all. This is more common for the POS tags; whereas, for NER labels, up to 80% of the sampled contexts are related to the task. The topic of these unrelated contexts varies  Table 3: Performance deltas (∆, column -row) and spearman correlations (ρ) of classes between label sets. ∆ diagonals report performance with that set. †: delta is within standard error; *: p << 0.001.
widely across labels, from biomedical and legal texts (see B-PER example context) to unrelated source code and news articles.

Relationship Between Labels and Classes
Due to the quantity of task data uncovered in the Pile, we would like to control for the effect of pretraining on labeled data. To this end, we evaluate GPT-NeoX on semantically meaningful labels that were not seen in labeled contexts; specifically, we replace the task labels with the English name for each class (e.g., adjective, B-location, etc), referred to as the words label set. The model achieves an accuracy of 78.11 ± 1.46 on POS tagging and an F1 score of 56.88 ± 0.86 for NER in this setting.
In Table 3, we compare the performance between these label sets and evaluate how correlated performance of individual classes are across these sets. We observe an identical ranking across label sets in POS tagging and NER. On NER, the difference in model performance between the true labels and words as labels is within standard error. However, on POS there is a small but significant decrease of 5.4 points between the two; this drop in performance likely quantifies the benefit of observing the POS task data in the Pile.
The correlation study shows that performance across classes on the original, proxy, and words label sets for POS tagging are all strongly correlated (ρ > 0.9). However, their correlations with the shuffled labels are less significant; this difference is likely due to the prior task knowledge GPT-NeoX has for UD labels leading to predicting the actual label of the class rather than the shuffled one, as seen in Section 5.2.

Related Work
Prompting PLMs for Sequence Information Recent work has applied various prompting approaches to sequence tagging tasks, with a particular focus on NER (Cui et al., 2021;Ma et al., 2022). However, these approaches also require further training, most often by learning new prompt embeddings for the task Liu et al., 2022b;. Other work has finetuned language models to apply them to sequence tagging tasks (Liu et al., 2022a). In contrast, our approach requires no additional parameters to be learned.
Additionally, similar approaches to prompting have been proposed for other tasks; these methods decompose a target task and repeatedly prompt the model on subtasks, building on the model's outputs to generate the final prediction Press et al., 2022). However, these approaches solve a different class of NLP tasks and use the outputs from the intermediate prompting steps differently (i.e., by conditioning on them in future prompting steps whereas in structured prompting each output is a predicted label).

Probing Pretrained Models
There is a rich literature on probing NLP models for their underlying knowledge (Belinkov et al., 2017;Blevins et al., 2018;Gulordava et al., 2018, inter alia.). The approach has become particularly popular for analyzing masked PLMs (e.g., Liu et al., 2019Liu et al., , 2021b, with behavioral probes (e.g. Petroni et al., 2019;Balasubramanian et al., 2020) in particular using the LM setup to elicit knowledge from the model. However, prompting autoregressive PLMs (Brown et al., 2020;Schick and Schütze, 2021;, though technically similar to behavioral probing, is usually not framed as probing the underlying model for knowledge. Some exceptions are Alivanistos et al. (2022), which uses prompting techniques to probe the LM for knowledge base relations, and , which replaces diagnostic probes with trained prompt embeddings for model analysis. Our work extends this framing by applying structured prompting as a behavioral probe for linguistic structure.
Analysis of Prompting Methods A major focus of this work is on analyzing what aspects of our setup make structured prompting work; the findings of this analysis are consistent with prior work. Specifically, our observations of the model prior label knowledge are similar to those in Min et al. (2022), where they find the model does not need the correct label correspondences in the demonstration to perform well. We expand on these findings by showing that the model can still perform in-context learning with proxy labels where the model has no prior mapping for the task.
Other work has also documented the presence of task data in common pretraining corpora (Dodge et al., 2021), shown the effect of pretraining term frequencies on in-context performance (Razeghi et al., 2022), and demonstrated the ability of LMs to learn from task data during pretraining (Magar and Schwartz, 2022). Similarly, we document the presence of task data and labels in the Pile and find that this signal can help task performance due to the model prior over the labels.

Conclusion
We propose structured prompting, a general paradigm for sequence tagging by prompting autoregressive PLMs. Our experiments show that PLMs perform well in this paradigm on few-shot sequence tagging for three tasks. Further analysis of structured prompting shows that (1) the approach can elicit linguistic structure from the model in many settings, including when the labels are unrelated to the task, and (2) while labeled task data is present in the pretraining corpora, using informative labels not derived from the task gives similar performance to using the task labels. These findings indicate that the understanding of these tasks is more general than memorization of the task data. More generally, our approach provides a method to probe PLMs for linguistic structure without training any new or existing parameters.

Limitatons
Data Leakage As discussed in Section 6.1, we find evidence of labeled task data for POS tagging and (to a more limited extent) NER in the Pile. We attempt to control for this leakage by also evaluating with class names as labels, rather than the original tag set; however, due to the cost of training recent PLMs and their large pretraining corpora, it is impossible to completely control for data leakage when prompting existing models.
Both Brown et al. (2020) and Chowdhery et al.
(2022) discuss the presence of task data in their pretraining corpora when training PLMs and the difficulty of controlling for it in their evaluations. For downstream users, this issue is further compounded in cases where the pretraining data is not available, as it is impossible to even check for contamination in those cases (such as our GPT-3 experiments).
Experimental Limitations with GPT-3 We only perform a subset of our evaluations of structured prompting on GPT-3, due to the cost of running the models in the API; this also means we do not run comprehensive prompt ablations to better tailor the setup for these models. Additionally, the results (i.e., lower performance than comparable GPT-Neo models) are difficult to interpret due to the black box nature of the GPT-3 models -it may be due to pretraining data differences (as mentioned in the previous limitation), the lack of prompt engineering for the models, or some other discrepancy.

English-only Experiments
The experiments in this paper focus on English sequence tagging tasks. It is unclear how well the proposed method will generalize to other languages. We find evidence of task-relevant data in pretraining corpora in non-English languages, which suggests there is signal for the approach to work in other languages.  In this section, we test additional factors that may affect the performance of our proposed method.

A.1 Choice of evaluation set
For computational reasons, the models are evaluated on a fixed subset of 1000 randomly sampled test examples for each task. As using a smaller evaluation set can introduce noise into our performance estimates, we run a similar experiment on a number of the smaller models but resample the evaluation examples across five runs in addition to  varying the demonstrations (Table 4). We find that varying the evaluation examples has a minimal effect on both the average performance and standard error on both POS tagging and NER.

A.2 Ablating BIO Constraints
During this work, we found that limiting the potential output tag space from the model with global BIO constraints made little difference in model performance for both NER and chunking (Table 5). Specifically, in every case, the difference between the two settings was within the standard error of the means across runs, with NER performing slightly better with the constraints and chunking performing slightly worse.

A.3 Full Results of Error Analysis
We provide additional error analysis results from Section 4.2 in Figure 6.

A.4 Full Results of Error Analysis
The complete data analysis for labels not shown in Section 6 is detailed in Table 6.

B Complete Results of Structured Prompting Experiments
We provide the full numerical results for the experiments in Section 4.1 in Table 7.