Retrieval-Augmented Parsing for Complex Graphs by Exploiting Structure and Uncertainty

,


Introduction
Large language models (LLMs) have demonstrated remarkable capabilities as effective few-shot learners (Brown et al., 2020).Recently, a new learning paradigm called in-context learning has been developed.Under this paradigm, the model is given a prompt including test inputs and a few related exemplars, and can generate outputs directly without updating its parameters.A typical approach to obtaining these exemplars is to retrieve training examples similar to the test input (Pasupat et al., 2021;Gupta et al., 2022).However, for realistic parsing tasks with large output graphs and non-trivial structure (e.g., dialogue-oriented semantic parsing), the † Co-senior authors.‡ Work done at Google.input similarity alone may not be effective in identifying the most informative exemplars for aiding graph prediction.As an example mentioned in Qiu et al. (2022), for the test input "Schedule a meeting with my manager", it is more similar to example "Schedule a meeting with Alex" than "Who is my manager", while the latter one contains an important action for searching an org chart which is also required by the test input.
In this paper, we explore effective approaches to improve the generalization performance of retrieval-augmented LLMs for parsing complex graphs.Specifically, we consider exploiting two sources of information uniquely available to this problem: (1) structural similarity between output subgraphs, and (2) model uncertainty in graph component prediction.The motivations behind our approach are two empirical investigations on LLM graph parsing (presented in Section 3): (a) Inadequacy of sequence-similarity retrieval (Section 3.1).When output graphs exhibit non-trivial structure, the exemplars retrieved based on sequence similarity is less effective than those based on graph similarity, even when the similarity is computed with respect to the gold output graphs1 .(b) LLM uncertainty correlates with performance (Section 3.2).We conduct a exploratory study of the quality of LLM uncertainty as an indicator of its generalization performance in graph prediction, and identified a monotonic between model uncertainty v.s.accuracy for node and edge prediction (Figure 2).This implies that model uncertainty can serve as an effective signal for identifying the subgraphs that the model is struggling to predict, thereby helping the retrieval algorithm to efficiency identify the most effective examplars for aiding model prediction, especially when the output graph is large.
Based on the above observations, we propose

Uncertain Subgraph Construction
{ "  !!, "  !" , … , "  !# } Figure 1: An overview of Structure-aware and Uncertainty-Guided Adaptive Retrieval (SUGAR).At each iteration i, the model prediction ŷi will go through a retrieval process consisting of three steps (red squared ones): (1) graph uncertainty quantification that measures the model's uncertainty in predicting structured output y i at the substructure level (Section 4.1.1);(2) uncertain subgraph construction that gets a collection of uncertain subgraphs ŷi k based on uncertainty scores of y i (Section 4.1.2);(3) structure-aware retrieval that retrieves structurally similar exemplars targeting at ŷi k (Section 4.1.3).Note that at iteration 0, the exemplars are retrieved by input similarity since there is no model prediction yet.
Structure-aware and Uncertainty-Guided Adaptive Retrieval (SUGAR), a retrieval-augmented LLM inference framework for complex graph parsing that incorporates both structural similarity and model uncertainty into retrieval procedure (Section 4).Operating in an iterative manner, SUGAR first identifies uncertainty regions in the model's graph prediction from the previous iteration, and adaptively retrieves exemplars based on their graph similarity with the identified uncertainty subgraphs (Figure 1, Section 4).In this way, SUGAR is able to better target model weaknesses in structural prediction by retrieving the most informative exemplars, which appears to be especially valuable in the setting of large and complex output graphs given limited retrieval budget.
On a suite of real-world complex graph parsing benchmarks (i.e., SMCalFlow and Ecommerce), SUGAR exhibits distinct strength over its classic counterparts that do not leverage uncertainty or structure, bringing clear performance improvement over the base model across iterations even when other retrieval-augmentation methods become counterproductive (Section 5).Further indepth analysis revealed the interesting role of exemplar quality on model uncertainty, the effectiveness of uncertainty as an early-stopping signal for retrieval iterations, and verifies the effectiveness of structural retrieval in improving model confidence (Section 6) 2 .

Related Work
Retrieval-Augmented Parsing.Sequence-tosequence (seq2seq) models have achieved stateof-the-art performance on many natural language processing tasks including complex parsing, e.g., dialogue-oriented semantic parsing and meaning representation parsing (Vinyals et al., 2015;Xu et al., 2020;Cui et al., 2022;Lin et al., 2022b,a).The general approach is to treat the output structure as a sequence and fine-tune a seq2seq model to learn the mapping between input sentences and output structures.To reduce the reliance on largescaled annotated data, several work augment the input with retrieved exemplars from the training data, with different strategy to select exemplars.
For unsupervised retrieval, Pasupat et al. (2021) and Gupta et al. (2022) retrieve exemplars with similar input encodings from a pre-trained neural encoder.Zemlyanskiy et al. (2022) retrieves exemplars for which the input and output (from the preliminary prediction) has high TF-IDF similarity with the sample input.The above work mainly focused on fine-tuning settings.For supervised retrieval, Rubin et al. (2022) suggest to use language models themselves to label examples that can serve as good prompts, and train a prompt retriever from this signal.In this work, we focus on unsupervised retrievers that do not rely on additional training data beyond the candidate pool they retrieve from.
Iterative Retrieval.While LLMs can generate coherent outputs via single-time retrieval-based augmentation, they often fall short in addressing more complex tasks.To address this, there have been various attempts to retrieve exemplars more than one time.dressing long-form outputs such as long-form question answering tasks (Fan et al., 2019;Stelmakh et al., 2022).The basic idea is to decomposing complex question into several easier sub-questions, and iteratievely retrieving relevant information from knowledge agents for each sub-quesitons (Press et al., 2022;Yao et al., 2022;Khot et al., 2022).Based on this line of work, FLARE (Jiang et al., 2023) further proposes to actively retrieving when the sub-answer contains low-confident tokens.However, iterative retrieval for parsing complex structured outputs is less explored.The core challenge lies in the non-sequential nature of the output structure such as tree or graph, which means it cannot be simply decomposed sequentially.In this work, we aim at addressing complex parsing tasks, and target on progressively improving model's prediction by iteratively retrieving relevant examplars for model's uncertain sub-structures.As we will show in Section 3, this can not be achieved without incorporating structure and uncertainty.

Motivations
In this section, we present two ablation studies which serve as motivations for our methods.

Structural Similarity Matters
The first question is what to retrieve.Here we study how different similarity functions perform on different semantic parsing tasks, which is under in-context learning settings using GPT3.5 with 10 exemplars in the prompt.
Specifically, we test input sentence similarity using Universal Sentence Encoder (USE) (Cer et al., 2018) and BM25 (Schutze et al., 2008), and output similarity using BM25 and SMATCH (Cai and Knight, 2013).Note that SMATCH is the only metric that considers structural similarity beyond simple token overlapping in the output (more details in Appendix A).We choose three different semantic parsing tasks with output structures from simple to complex, including (1) MTOP (Li et al., 2021): a user intent slot filling dataset which can be simplified as a sequence labeling task; (2) SM-CalFlow (Andreas et al., 2020): a dataset of semantically detailed annotations of task-oriented natural dialogues, which can be taken as a tree parsing task; (3) Redwoods-Ecommerce (Ecommerce for short) (Oepen et al., 2002): a dataset of annotated meaning representation (outputs are directed acyclic graphs) based on English Resource Grammar (Flickinger et al., 2014), which is a DAG parsing task.
The evaluation results are shown in Table 1.We can observe gaps between standard retrieval (based on input similarity) versus oracle retrieval (based on output similarity), a finding that aligns with Qiu et al. (2022).Furthermore, as the output structure gets more complex (from MTOP to Ecommerce), it becomes more important to have exemplars that are similar in output structure, compared to just input similarity or sequence-level output similarity.This is because sequence-level similarity metrics only consider token overlapping, and ignore syntactic or semantic relationships in the output structure.As output becomes more complex, these similarity metrics are less likely to find structural similar exemplars.Therefore, it is important to have a structure-aware retriever for complex parsing tasks.

Model Uncertainty Matters
However, retrieving exemplars based structural similarity can have several challenges.First, in the initial settings we do not have access to gold output structures.This can be solved by getting some preliminary prediction using retrievals with similar inputs as proposed in Zemlyanskiy et al. (2022).Second, retrieving similar outputs based on the entire structures can be ineffective and may introduce unwanted noise.Given the emerging challenges, we are investigating if it is possible to measure struc-0.0010.783 0.970 0.991 0.997 0.998 0.999 0.999 1.000 1.000 1.000 1.000 Node/Edge Probabilities  If so, which part to retrieve?Our hypothesis is that we can retrieve only when model is uncertain about some sub-structure predictions, which are very likely to be flawed.
To validate our hypothesis, our second ablation study analyzes LLM's behavior in predicting structure components in terms of model uncertainty.Specifically, we explore the correlation between model probability and performance for in-context learning model (more details can be found in Appendix B).As shown in Figure 2, high model probability generally corresponds to high performance and vice versa.Our study confirms that model uncertainty is effective for detecting prediction errors.This means that we can retrieve structurally similar exemplars targeting on these uncertain substructures, which can help to address the flawed parts in the prediction.

SUGAR: Structure-aware and Uncertainty-Guided Adaptive Retrieval
This section describes details of SUGAR for parsing complex structures.Typically, the output structure is a semantic graph that is rooted, directed, acyclic and labeled (Opitz, 2023).Problem Formulation.We aim to solve a graph parsing problem that maps from a natural language utterance x to a target graph representation G = ⟨N, E⟩, where N is the set of nodes and E ∈ N × N is the set of edges.For seq2seq models, G is represented as a linearized graph string y.Following Lin et al. (2023), we adopt PEN-MAN annotation (Kasper, 1989) to linearize all graph structures in this work, which is a serialization format for the directed, rooted graphs used to encode semantic dependencies (details for graph linearization can be found in Appendix C).
Figure 1 shows an overview of SUGAR and Algorithm 1 summarizes the detailed process.In Section 4, we illustrate the retrieval process based Algorithm 1 Retrieval-augmented Inference with SUGAR
on preliminary predictions ŷi at step i (the retrieval process in Figure 1), which includes three steps: (1) graph uncertainty quantification for ŷi (Section 4.1.1);(2) uncertain subgraph construction, i.e., ŷi k (Section 4.1.2);(3) structure-aware retrieval for ŷi k (Section 4.1.3).The retrieval process is operated in an iterative manner, which enables a progressive improvement for model's prediction by iteratively retrieve exemplars based on model's predictive uncertainty from the previous turn (Section 4.2).

Graph Uncertainty Quantification
Recent year witnessed the success of applying seq2seq models to graph parsing tasks, where the outputs are compositionally structured (e.g., a graph or a tree).However, these seq2seq approaches pose a technical challenge in properly quantifying the model uncertainty for graph prediction.This is because the autoregressive seq2seq probability is not well-suited for describing model uncertainty in predicting elements or substructures of the output graph, where the probabilistic graphical model (PGM) is a more suitable formalism.To address this issue, we leverage Graph Autoregressive Process (GAP) proposed by Lin et al. (2023) to allow the correspondence between seq2seq output probability to PGM probability, i.e., assigning model probability for a node or edge on the graph.
Specifically, given an input sequence x and output sequence y = y 1 y 2 • • • y N that refers to a graph G = ⟨N, E⟩, GAP can map the token-level autoregressive distribution where p(v| pa(v), x) is the conditional probability for graph elements v with respect to their parents pa(v) in G.

Uncertain Subgraph Construction
To leverage uncertainty in the model prediction p(G|x) for efficient retrieval, we consider the concept of uncertain subgraph which is a subgraph that contains: • Uncertain element.We consider a graph element v ∈ G to be uncertain if its probability p(v| pa(v), x) is below a certain threshold ϵ.
• Relatively-confident neighbors.Given a uncertain element v and a subgraph s d (v) that surrounds v and with maximum depth d.We define the relatively-confident neighbors of v as the subset c d (v) ⊂ s d (v) whose probability is above a certain threshold ϵ.
As shown, by coupling the uncertain element v with its relatively-confident graph neighbor c d (v), uncertain subgraphs ŷv = {v} ∪ c d (v) provides the retrieval algorithm fine-grained and contextualized information about model uncertainty in the prediction of graph elements (see Figure 3 for an example).In practice, to limit the size of uncertain subgraphs and keep the cost of structural similarity computation within a feasible range, we set a parameter d to control the maximum depth of uncertain subgraphs ŷv .

Structure-aware Retrieval
To identify informative graph exemplars that best address the model uncertainty in predicting graph elements, we leverage the uncertain graph introduced above and consider a retrieval policy using  ŷv 's as the query (with the uncertain element v masked out) to retrieve structurally similar exemplars.
Specifically, we consider the typical setting where there is a retrieval candidates pool c=1 which are pairs of input sentences x c and output graphs y c .To perform structure-aware retrieval, we first prepare a subgraph retrieval pool where S d (y c ) = {y c j } j is the set of all depth-d subgraphs of y c .Then, at inference time, given every uncertain subgraph u d (v), we retrieve {y c ′ ,j ′ } c ′ ,j ′ ⊂ P g based on their graph similarity with u d (v), which eventually leads to the set of exemplars {(x c ′ , y c ′ )} c ′ that will be used for the retrieval-augmented inference.
Practical Implementation.In this work, we consider the SMATCH metric (Cai and Knight, 2013) for computing graph similarity.Given a query graph with N q nodes and k candidate graphs with N c nodes each, the time complexity of the graph matching algorithm is O(k * N q * N c ).In practice, the size of N q is controlled by d, and k can be significantly reduced by first pre-filter the candidate pool using a fast heuristic metric (e.g., atom similarity)3 .

Improve Parsing Performance with
Uncertainty-aware Iterative Retrieval Due to its uncertainty-aware nature, SUGAR introduced in Section 4.1 can be applied iteratively to model prediction, by continuously retrieving new exemplars to address model uncertainty in the previous iteration until model reached a satisfactory level of confidence.This is analogous to the iterative refinement approaches in the recent literature where a model's initial generation can improved by additional self-correction steps (Reid and Neubig, 2022;Schick et al., 2023;Welleck et al., 2023;Jiang et al., 2023).
In the experiment (Section 6.2), we study model performance under different retrieval and refinement strategies, validating that incorporating structure and uncertainty information are both important for improving parsing performance under iterative refinement.

Datasets & Model Settings
Datasets.In this paper, we use two complex semantic parsing datasets, including dialogue-oriented semantic parsing and graph-based grammar parsing.
• SMCalFlow.SMCalFlow (Andreas et al., 2020) is a large corpus of semantically detailed annotations of task-oriented natural dialogues.The annotation uses dataflow computational graphs, composed of rich set of both general and application specific functions, to represent user requests as rich compositional expressions.
• Redwoods-Ecommerce (Ecommerce).The LinGO Redwoods Treebank is a collection of hand-annotated corpora for an English grammar consisting of more then 20 datasets (Oepen et al., 2002).We choose the Ecommerce subset of Redwoods which consists of email messages composed by the online customers.The graph output of this subset is sufficiently complex, and is considered out-of-distribution compared to the standard training split of Redwoods based on Wall Street Journal (Lin et al., 2022b).
Models & Base Retrievers.We choose GPT3.5 (text-davinci-003; Ouyang et al., 2022) as our large language models for in-context learning settings.To initialize SUGAR prediction, we consider three choices of base retrievers to be used for initializing SUGAR prediction: (1) Random, (2) CASPER use (Pasupat et al., 2021) that is based on Universal Sentence Encoder (USE) (Cer et al., 2018), and (3) BM 25 (Robertson et al., 2009).Recall that these retrievers will only be used to initialize the first round of SUGAR prediction, and not used in subsequent iterations (Algorithm 1).
Baselines.Due to the iterative nature of SUGAR, prediction results are comparable to base retrievers mentioned above using the same total number of exemplars.For example, base retriever using 8 exemplars is comparable to SUGAR at iteration 1 (5 base exemplars plus 3 iterative exemplars).However, due to the sequence limitation, it is unfeasible to add large number of exemplars in one prompt, which makes it impossible to get results for base retrievers using more than 10 exemplars.This also highlights the advantage of iterative retrieval as it can get rid of sequence length limitation.Another benefit is that it can get the model's intermediate results towards the final optimal prediction, and more exemplars are added to concentrate on these intermediate results' weak parts.We also consider two iterative variant of baselines that is based on output similarity: GandR iter (Zemlyanskiy et al., 2022) that retrieves examplars based on BM 25 with both input sentence and predicted graphs (weight α=0.5)4 , and also Oracle that retrieves examplars based on graph similarity with gold subgraphs.Further details about settings of model, candidate pool and prompts can be found in Appendix D, and Appendix E reports additional supplementary studies on much smaller models T5 (Raffel et al., 2020) for out-of-domain and low-resource settings.

Results
The evaluation results are shown in Figure 4.As shown, SUGAR progressively improves the model prediction across iterations, even in the setting where its iterative counterparts becomes counterproductive (e.g, Base=BM 2.5 on Ecommerce).Specifically, SUGAR significantly outperforms its base retrievers with the same retrieval budget (Base@8), improving the base retriever CASPER use and BM 25 by 26.61% and 20.36% respectively in absolute percentage on SMCalflow, and 17.58% and 12.37% respectively in absolute percentage on Ecommerce.
In Appendix F, to understand if the benefit of SUGAR is orthogonal to base model choice, and cannot be surpassed by simply upgrading model Figure 4: In-context learning results.Base@5 and Base@8 refer to retrieving 5 and 8 exemplars by base retriever.
Since iterative retrieval will add 3 additional exemplars per iteration, Base@8 is comparable to results at iter 1.
model architecture, we conduct the following two sets of experiments: (1) a comparison between SUGAR and its baseline variants based on older variants of GPT models (e.g., text-davinci-002) which differ in training method.
(2) comparison of baseline methods under GPT-4 v.s.SUGAR under GPT-3.5.The experimental results have shown that SUGAR successfully improves all LLMs in terms of exact match and graph similarities.Furthermore, SUGAR with GPT3.5 also works better than GPT4 using base retriever.This concretely shows our method provides non-trivial improvement beyond what can be achieved by simply upgrading architecture.
Table 2 shows some exemplars (sentence inputs) from different retrievers.We can find that compared to other retrievers, SUGAR can accurately retrieve exemplars that closely align with the uncertain parts of the model prediction, which is significant in addressing the uncertainties that are inherent in the model inference, thereby refining the model's prediction.
6 Analysis: the Role of Uncertainty

Uncertainty Quality Matters
We notice that SUGAR does not work very well on the random base retriever, and this might be due to the relatively poor uncertainty quality of the random base retriever in comparison to the other two base retrievers.To validate our hypothesis, we further evaluate the uncertainty quality of the three base retrievers.
A common approach to evaluate model's uncer-tainty quality is to measure its calibration performance, i.e., whether the model's predictive uncertainty is indicative of the predictive error, e.g., expected calibration error (ECE;Naeini et al., 2015).Based on ECE, Lin et al. (2023) propose Compositional Expected Calibration Error (CECE) to measures the difference in expectation between the model's predictive performance on graph elements and their actual match to the gold graph, which can better reflect model's behavior in predicting graph structures.Table 3 reports the CECE results (based on all graph elements, nodes, and edges).
We can find that the calibration for random base retriever is consistently worse than BM 25 and CASPER use , indicating that the relativelyconfident neighbors in uncertain subgraphs for random retriever might fail to capture accurate contexts for iterative retrieval.This also indicates that improving LLMs' calibration can be a fruitful venue for improving retrieval-based data augmentation.

Uncertainty as a Early Stopping Signal
Given that the parsing generation in Figure 1 to edit previous model prediction is a zero-shot process, i.e., the LLM has not been provided with any examples to improve model prediction at each step, thus the model has a tendency to keep making edits to predictions from previous iterations, even if they are already accurate.At this stage, uncertainty becomes a significant signal for stopping the iteration process.
Specifically, SUGAR will terminate the iteration retrieval process and return the last model predic-   tion once a satisfactory level of confidence has been achieved.Additionally, we incorporate model uncertainty as a stopping signal for the GandR iter baseline (Table 4), and we consistently observe an improvement compared to the baseline that doesn't utilize uncertainty as a stopping signal.This further validates the crucial role of uncertainty in the iterative retrieval process.

Uncertainty across Retrieval Iterations
Two additional interesting questions could be ( 1) what type of graph requires more iterations?and (2) how does uncertainty evolve across different retrieval iterations?
We first explore the correlation between number of iterations and graph complexity and report the results in Table 5 (base retriever is CASPER use ).As we expected, a graph of higher complexity typically demands a greater number of iterations before achieving a satisfactory level.
We then visualize some graphs at different iterations and explore the progression of uncertainty levels (see details in Appendix G).Generally, we notice that an uncertain graph elements will become less uncertain as iterations progress.However, we also observe occasional fluctuations in the uncertainty, which can trigger instability in neighboring contexts.This means that as uncertain elements become certain, the contexts around them may lose some degree of confidence.We reserve further exploration of this phenomenon for future work, which we believe is a promising direction for understanding LLMs' calibration in complex structure generation.

Conclusions
In this work, we present Structural-aware and Uncertainty-Guided Adaptive Retrieval (SUGAR), a new retrieval-augmented parsing framework us- ing LLMs for complex graphs.This work deepens the current practice of retrieval-augmented models for complex structures by incorporating information related to the model's uncertainty of graph component prediction and structural similarity of output subgraphs.Experimental results on two complex seq2seq semantic parsing tasks, i.e., SM-CalFlow and E-commerce, have demonstrated the practical effectiveness of the proposed approach in the modern setting of graph parsing with pretrained LLMs.
Our future work includes considering more advantage graph similarity metric beyond SMATCH (e.g., incorporates additional similarity metric between the space of node and edge properties), and also larger scale and more fine-grained evaluation on graph parsing benchmarks with distinct properties (e.g., long-tail generalization, graph of specific families) to further elucidate in what setting is the graph similarity most effective.Furthermore, it is also of interest to study the generalization of this approach to a broader class of modern program synthesis problems, e.g., code generation.

Acknowledgement
Our work is sponsored in part by National Science Foundation Convergence Accelerator under award OIA-2040727 as well as generous gifts from Google, Adobe, and Teradata.Any opinions, findings, and conclusions or recommendations expressed herein are those of the authors and should not be interpreted as necessarily representing the views, either expressed or implied, of the U.S. Government.The U.S. Government is authorized to reproduce and distribute reprints for government purposes not withstanding any copyright annotation hereon.We thank Deepak Ramachandran for helpful discussion.

Limitations
This work focuses on advanced inference procedures to improve retrieval-augmented LLMs for complex graph parsing problems.The work is limited in two aspects due to the nature of LLM setup: (1) our evaluation has focused on the in-context learning setting where the LLM is not fine-tuned on domain-specific data.Although a standard setting of modern LLMs, it is still of scientific interest to understand the interplay between parameter finetuning and the effectiveness of retrieval-augmentation procedures, which we leave for future work.( 2) This work has focused on GPT3.5 which was one of the strongest and openly available LLMs at the time of the writing.As the behavior of LLM can be impacted by its pretraining procedures, it is also of interest to generalize this study to a wider class of LLMs.(3) Finally, the graph similarity metric considered in this work (i.e., SMATCH) has a computational complexity quadratic in graph size.The current work mitigates the issue by restricting its attention to degree-d subgraphs, with the caveat of limiting SUGAR's ability to reason about similarity in the global graph structure.Therefore identifying practical and more computationally efficient structure similarity metrics can further improve the scalability of the SUGAR approach.study confirms that model uncertainty is effective for detecting prediction errors.This means that we can retrieve structurally similar exemplars targeting on these uncertain substructures, which can help to address the flawed parts in the prediction.

C Graph Linearization
We use PENMAN notation for graph linearzation, which is originally called Sentence Plan Notation in the PENMAN project (Kasper, 1989).PEN-MAN is a serialization format for the directed, rooted graphs used to encode semantic dependencies, mostly notably in the Abstract Meaning Representation (AMR) framework (Banarescu et al., 2013).It looks similar to Lisp's S-Expression in using parentheses to indicate nested structures.
To make PENMAN notation compatible with the seq2seq learning, we adopted a variable-free version of PENMAN which was first proposed in Lin et al. (2022b).Table 6 shows some variable-free PENMAN linearized examples for the two semantic parsing datasets we adopt in our experiments.

D Detailed Experiment Settings
Parameter Settings For the uncertain threshold ϵ, considering that at the initial stage, the model's predictions are relatively weak, we set a warm up schedule for ϵ.Specifically, ϵ 1 = 0.5, ϵ 2 = 0.8, ϵ 3 = 0.9.We set subgraph max depth d = 3.For each model prediction, the number of uncertain subgraphs k = 3, and we will retrieve 1 exemplars for each uncertain subgraph.
Models For in-context learning settings, considering the impressive performance achieved by GPT3.5 (Ouyang et al., 2022), we test our methods on text-davinci-003.For fine-tuning settings, we choose T5 (Raffel et al., 2020) as our pretrained model, which is a pre-trained sequence-tosequence Transformer model that has been widely used in many NLP applications.We use the opensourced T5X5 , which is a new and improved implementation of T5 codebase in JAX and Flax.Specifically, we use the official pretrained T5-Large (770 million parameters).

Datasets Inputs Outputs
SMCalflow User: What time on Tuesday is my planning meeting?
( _ship_v_cause :ARG1 ( pron :BV-of ( pronoun_q ) ) :ARG2 ( _item_n_of :BV-of ( _the_q ) :ARG1-of ( _wrong_a_with ) ) ) Prompt Design There are two prompts adopted in the in-context settings using GPT3.5.The first one contains exemplars from the base retriever to generate the preliminary prediction at the initial stage (prompt 1 shown in Figure 6).The second one contains exemplars from uncertainty-guided retrieval and incorporates the prediction from the previous step (prompt 2 shown in Figure 7).
The evaluation results are reported in Table 7.As can be seen, other retreivers that do not consider structural similarity all fail on these two complex parsing tasks, i.e., perform even worse than base model without retrieval, while SUGAR can persistently improve base model on both datasets.Specifically, SUGAR achieves error reduction rate as 6.72% and 9.45% on SMCalflow and Ecommerce respectively, and improves exact match rate by 2.72% and 8.38% respectively.

G Sample Graph Prediction Visualizations
Some sample graph prediction visualizations on Ecommerce dataset using CASPER-USE are shown in Table 8.We can observe that as iteration goes, the confidence scores of uncertain graph elements generally increase and the number of uncertain graph elements generally decreases.However, we also observe occasional fluctuations in the uncertainty, which can trigger instability in neighboring contexts.For example, the node pron from the first to second iteration in the second example (the probability decreases from 0.9381 to 0.8746).We reserve this observation for further exploration in future work, which we believe is a promising direction for understanding LLMs' calibration in complex structure generation.

Figure 2 :
Figure 2: Correlations between model probabilities and performance for node/edge prediction

Figure 3 :
Figure 3: An example for uncertain subgraph (ϵ = 0.8, d = 2) in prediction for input sentence "You send me the wrong camcorder" (Ecommerce).The dotted squared one is the uncertain subgraph, where the red squared node and red fonted edge refer to uncertain elements, and the rest parts are their neighbor contexts.

Figure 5 :
Figure 5: Correlations between model probabilities and performance for node/edge prediction on SMCalFlow and Ecommerce using different retrieval strategies.

Figure 6 :
Figure 6: Prompt example contains exemplars from base retriever.
Most of the work focused on ad-

Table 2 :
Exemplars from different retriever on SMCalflow development set.Highlighted text refers to uncertain parts.

Table 3 :
Compositional ECE for different base retrievers on SMCalflow and Ecommerce under in-context learning settings.

Table 4 :
Results of GandR iter baseline in iterative retrieval with and without model uncertainty as the signal for iteration stop.* Means using uncertainty.

Table 6 :
Examples for variable-free PENMAN linearized graph on Ecommerce and SMCalflow (task details can be found in Section 5.1).Here :carg means corresponding spans in the sentence.

Table 7 :
Fine-tuning results.EM means exact match rate.Precision and recall are based on graph element triples.