ASPIRO: Any-shot Structured Parsing-error-Induced ReprOmpting for Consistent Data-to-Text Generation

We present ASPIRO, an approach for structured data verbalisation into short template sentences in zero to few-shot settings. Unlike previous methods, our approach prompts large language models (LLMs) to directly produce entity-agnostic templates, rather than relying on LLMs to faithfully copy the given example entities, or validating/crafting the templates manually. We incorporate LLM re-prompting, triggered by algorithmic parsing checks, as well as the PARENT metric induced consistency validation to identify and rectify template generation problems in real-time. ASPIRO, compared to direct LLM output, averages 66\% parsing error rate reduction in generated verbalisations of RDF triples on the DART dataset. Our best 5-shot text-davinci-003 setup, scoring BLEU of 50.62, METEOR of 45.16, BLEURT of 0.82, NUBIA of 0.87, and PARENT of 0.8962 on the Rel2Text dataset, competes effectively with recent fine-tuned pre-trained language models.


Introduction
Data-to-text task (Reiter, 1996) aims to build a faithful natural language interpretation of structured data such as relational tables or Resource Description Framework (RDF) triples (Miller, 2001).However, without proper context, the given structured data may not sufficiently represent the relationships between entities, leading to ambiguity (Dušek et al., 2019).To battle this, some works rely on fine-tuning pre-trained language models (PLMs) on task-specific datasets in supervised or semi-supervised ways (Ke et al., 2021;Agarwal et al., 2021), but the domain of the resulting system is limited and requires well-labelled training data (Keymanesh et al., 2022).In contrast to finetuning, Kasner and Dusek (2022) prove that zeroshot neural systems are a possible solution, where in-domain data is introduced via simple humancrafted templates for each unique relation in the 1 code available at github.com/vejvarm/ASPIRO. knowledge graph.Xiang et al. (2022) nullify the requirements for human labelling entirely by utilising GPT3-davinci (Brown et al., 2020), a large language model (LLM) with broad general knowledge, to disambiguate RDF triples into short sentences and automatically parse them into reusable sentence templates as an alternative to human-crafted templates.In this paper we introduce ASPIRO, a robust N -shot variant of the data disambiguation step presented by Xiang et al. (2022) and a promising alternative to fine-tuning PLMs for crafting RDF verbalisations (Kasner et al., 2023).At its core, AS-PIRO uses simple rules to algorithmically flag errors in the templates (such as missing subject, multiple objects, etc.) and re-prompt the LLM until all errors are alleviated or maximum (N ) retries have been reached.We evaluate changes in automated metrics and reduction of parsing errors in different configurations of ASPIRO on DART (Nan et al., 2021) and Rel2Text (Kasner et al., 2023) and compare the original RDF verbalisation prompt used by Xiang et al. (2022) with our prompt focused on enforcing structured json output with intermediate fields as guidelines.
Delexicalization: Einolghozati et al. (2020) and Heidari et al. (2021) find that without delexicalization, generative models can produce incomplete representations of the entities and concepts in the structured data verbalisations, leading to misinterpretation and failures in production.Our JSON structured prompt ( §G.2) enforces the LLM to directly produce named-entity agnostic templates.
0-shot to N -shot: Our work is heavily inspired and builds upon the disambiguation step from Xiang et al. ( 2022), which is equivalent to 0-shot setting for our N -shot Generator.We also use their prompt ( §G.1) as baseline against our JSON prompt ( §G.2).
Refining LLM outputs: Madaan et al. (2023) and Shinn et al. (2023) show that iterative prompting and chain-of-thought reasoning can significantly improve the outputs of LLMs.We lean on their findings in designing our ASPIRO pipeline.However, back and forth prompting of LLMs can be expensive, which we counterweight by using our Rule-based parser ( §3.1) and the PARENT (Dhingra et al., 2019) F1 score ( §3.2) as cost-efficient gateways to decide if additional prompting is necessary.

Methods
The proposed method (ASPIRO) revolves around the conversion of structured data samples into verbalisation templates using a two-stage pipeline: Nshot Generator ( §3.1) and Consistency Validator ( §3.2).The pipeline processes structured data samples, wherein each sample comprises of one or more RDF triples which share the same relation.ASPIRO (see Figure 1) starts with an initial prompt to verbally articulate the structured data.This is equivalent to prompting a single LLM directly.If the zeroth attempt isn't accurate, it will retry a maximum of N times, refining the previous completion based on parsing errors ( §3.1.2).Subsequently, the outputs are validated for consistency, ensuring faithful and reliable verbalisations.We explain the individual stages and their sub-modules in the sections below.Refer to Figure 1 for full pipeline and terminology on general input.
Step-by-step flow of the pipeline and example on specific input are provided in section §3.3 and Figure 2 respectively.

N -shot Generator
N -shot Generator further fractures into an LLM stack and a Rule-based parser.The LLM Stack is tasked with generating verbalisation attempts based on given initial prompt ( §G.1).It does so with the help of the Rule-based parser.This parser checks the generated completions for structural accuracy, ensuring they adhere to expected patterns.

LLM Stack
The LLM stack is a sequence of N + 1 LLMs, indexed from 0 to N .L 0 is responsible for the initial completion and each further retry shot, initiated by the Rule-based parser ( §3.1.2),increments the index by 1.Each L n is instantiated separately and does not have to be the same model.Equation (1) shows the single completion for structured input sample x at shot n.
where T is a given prompt and can be either T I (initial) or T R (retry).

Consistency Validator
Even if the outputs from the N -shot Generator adhere to the structural patterns, they might still contain inaccuracies, such as hallucinated content.This module assesses the quality of the verbalisations, using the PARENT statistical metric (Dhingra et al., 2019).If PARENT F1 score is too low, the module will utilise an LLM with specialised Consistency prompt ( §G.4) to improve the sentence.

PARENT F 1 threshold
To gauge the quality of the completion y n from Nshot Generator, we set a minimal threshold (µ) for the PARENT score of y n .The score is calculated using eq.( 3) against artificially constructed table and reference.First, we construct the respective hypothesis, table and reference entries: where "<subject>" and "<object>" are replaced with "<entity>" to prevent penalising order discrepancy between hypothesis and table.
We then calculate the PARENT F1 score using equation (3).

Consistency LLM
If the calculated PARENT score from §3.2.1 is not sufficient, we call another LLM with prompt T C as in eq. ( 4).
The prompt T C is designed to guide L C to identify problems with the given completion, provide advice how to fix it and subsequently produce fixed completion in a structured json output.See §G.4 for full version of the prompt.

Stepwise Pipeline Formulation
Given a dataset of structured data samples {x r } r∈R , where x r = {x r 1 , x r 2 , ..., x r m } and x r j is a single RDF triple x r j = ⟨s r j , r, o r j ⟩ with relation r ∈ R, the pipeline for one x r is as follows: Step 0 Set n = 0 and T r 0 = T I (x r ).

N-shot gen. (shot 0)
From the given [data], make a simple general sentence where <s> and <o> replace the subject and object.
[data]: [ Mario, creator, Shigeru   Step 2 Use §3.1.2to validate y r n against all conditions C. If errors (E r n ) are found, run equation ( 5) and return to Step 1. Otherwise go to Step 3.
Step 3 Use §3.2.1 and calculate F 1(y r n ) via eq.( 3).If the calculated F 1 score is lower than our chosen threshold 0 ≤ µ ≤ 1, continue to Step 4. Otherwise, output current y r n as the final completion y r .
Step 4 Use §3.2.2 to get revised completion y r C .
Step 5 Compute F 1 scores of y r n and y r C using eq.( 3) and take the completion with higher score via eq.( 6) to produce the final completion y r .

Experiments
The following sections show results on several setups of ASPIRO.In section §4.1 we compare auto-  2023)'s fewshot-fine-tuned PLMs on all metrics and is competitive to the full-training-set-fine-tuned full-rel2text model with ca 1-2 point reduction in BLEU, but 28 % points increase in BLEURT.This implies higher semantic similarity, however Semantic Similarity sub-score of NUBIA only shows small increments.Despite the overall NB score being same for all ASPIRO setups, the submetrics of NUBIA show steady improvement between our models.Most noticeable change is in the Contradiction percentage, which the 5-shot setting improves by ca 1.2 % points and further 2 % points by introducing JSON prompt, suggesting higher capacity to disambiguate the correct direction of relation between subject and object entities in the input triples.PARENT F1 score slightly favours the JSON prompted setups of ASPIRO, but only by ca 0.6 % points.
Additional experiments: For metric results and discussion on DART, see appendix §E.1.For full experiment results with fine-tuned pre-trained language models refer to (Kasner et al., 2023).

Parsing Errors
Parsing error analysis does not require specific references from the dataset.After ASPIRO produces the verbalisation templates (y r ), we run them through our Rule-based parser ( §3.1) to flag and count the number of errors.As source data (X), similar to (Xiang et al., 2022), we collect at most 2 triple examples for each unique relation in the dataset and use them to prompt our pipeline.
Parsing error counts: For DART (Table 3) we use the full dataset ( §D.1), producing 4299 unique template sentences in each experiment run.In Rel2Text (Table 4) we only use the test split ( §D.3) with 226 unique relations and G3.5 (T2) as base model with either (A)SDOT or (J)SON prompts and different N -shot Generator setups.For Rel2Text, we don't provide RR % as the reduction is evident from counts.
Discussion: Introducing N -shot Generator ( §3.1) shows significant reduction in parsing error counts (Tables 3 and 4) even with N = 1.In the 1 retry shot setting, GPT4 (G4) is most effective at reducing parsing errors.However, if we introduce up to 5 retry shots, we can see that gpt-3.5-turbo(G3.5T) reduces parsing errors further.The exception is (J)SON prompt on DART where G4 keeps the lead.Interestingly, while text-davinci-003 (G3.5) performs well as 0-shot model, it generally performs worse than G3.5T in N -shot settings, contrasted again on DART by J prompt.It is also evident that J prompt provides more robust 0-shot baseline compared to (A)SDOT prompt.The values in parentheses reveal that including Consistency Validation yields only slight reduction in error count.

Ablation of Consistency Validator
To investigate the efficacy of Consistency Validator, we conduct a brief ablation study on Rel2Text test set ( §D.3).For statistical metrics (Table 5), CV provides only marginal gains.This effect may be attributed to the improvement of Contradiction score and degradation of Neutrality score, implying that CV moves the templates closer to general state-ments with less informational value.Conversely, parsing errors (Table 6) are reduced notably by CV, with counts decreasing from 12 to 10 and 23 to 16.

Conclusion
We proposed and evaluated ASPIRO, a general domain-agnostic pipeline for verbalisation of single triple data entries to short template sentences, utilising rule-based re-prompting of LLMs.The pipeline comprises of N -shot Generator ( §3.1) and Consistency Validator ( §3.2).We show that AS-PIRO compares to fine-tuned pre-trained language models' automatic scores on the Rel2Text test set ( §4.1) and significantly reduces the parsing error count in 0-shot outputs of LLMs ( §4.2).The ablation study ( §4.3) revealed that Consistency Validator of ASPIRO further reduces error counts, but does not significantly affect automatic metrics.

Limitations
Operational costs: When contrasted with 0-shot setting, ASPIRO significantly escalates the operational costs (see appendix §F) due to the repeated calls of the N -shot Generator and the lengthy Consistency prompt ( §G.4) associated with the Consistency Validator ( §3.2).Following the brief ablation study of CV ( §4.3) and the cost analysis, it remains debatable whether the performance of the Consistency Validator reported in this paper justifies the additional expense incurred in prompting the LLM for the flagged examples.
Isolated triples: Generating verbalisations from single isolated triples doesn't account for situations where context from other triples is necessary to fully interpret the final natural language verbalisation.As exemplified by the DART dataset, contextual integration is significant and should be explored further.

Backup template:
In instances where the parsing of the <subject> and <object> within the generated completion of the LLM proved unsuccessful, Xiang et al. (2022) introduced a general backup template as fallback.In our research, we did not use any backup templates and did not investigate their potential impact on automated metric scores.Nonetheless, it's important to acknowledge that within a production environment, the incorporation of a backup template is a fundamental necessity, warranting further assessment of its effects.

Direction of relation:
The capacity to accurately discern the correct direction of the relation between subject and object is a notable feature of Data-totext systems.In our experiments, we report on contradiction statistic (C %), which can roughly translate to measure this ability.Although ASPIRO generally shows to improve on this statistic, there are no specific guardrails to validate the ambiguity other than the general knowledge of the LLM itself.
Variance of experiment runs: Due to the substantial expenses associated with prompting large language models (LLMs) and the considerable size of the DART dataset, each experiment on DART was conducted only once.The same is true for Rel2Text parsing error analysis in Table 4.It should be noted that, although the temperature parameter was uniformly set to 0 for all the employed LLMs, the underlying generative process remains reliant on maximum likelihood estimation, which inherently leaves room for potential variation errors in our experimental results.

Ethics Statement
In the course of this research, we have employed various Generative Pre-trained Transformer models, including GPT3 davinci, InstructGPT textdavinci-003 and gpt-3.5-turbo-0301,and gpt-4-0314, each demonstrating inherent biases as outlined in their respective publications, which are listed in Table 2.These biases often favour popular opinions and can lead to a distortion in the model's outputs.This reflects the models' training on largescale internet text, which is not entirely neutral and contains biased or skewed perspectives.We acknowledge this limitation and highlight that despite implementing a pipeline designed to minimise the inclusion of unnecessary and irrelevant information, the potential for biased outcomes cannot be entirely eliminated.

C Metrics
For comparability of our automatic metric evaluations ( §4.1), we leverage most of the lexical and semantic similarity metrics used by Kasner et al. (2023).Below is a brief explanation of their significance.
BLEU (Papineni et al., 2002) and METEOR (Banerjee and Lavie, 2005) are metrics that quantify the lexical similarity between model-generated outputs and (typically human-produced) references by utilising n-gram overlap.
PARENT (Dhingra et al., 2019) additionally assesses n-gram overlap of generated outputs with source structured data (i.e., table ), which acts as an auxiliary reference beyond the reference text.This metric rewards instances where the hypothesis encapsulates all the information derived from the table, even if some elements are absent from the reference.Conversely, the PARENT score discriminates situations where the reference text contains supplementary information, which is absent in the structured data, implying the integration of external knowledge or embellishments in the reference.
BLEURT (Sellam et al., 2020) is a trained metric that complements the above lexical similarity metrics by capturing semantic similarity.
NUBIA (Kane et al., 2020) is also a trained metric that combines multiple sub-metrics to assess the interchangeability or equivalence of two texts.
On the surface, this metric generates a single score (NB), that ranges between 0 and 1.Similar to Kasner et al. (2023), we also report on the sub-metrics which are used for the total NB value: SS Semantic Similarity operates on a scale of 0-5, where higher values suggest higher semantic similarity C% Contradiction percentage increases as the output and reference contradict each other in their meaning.
N% Neutral percentage (also referred to as "chance of irrelevancy") 10 increases if the output contains new information or information which is irrelevant to the reference.
E% Entailment percentage increases as the information in reference is entailed by the model output.

E Additional Experiments E.1 Automatic metrics for DART-SingleRDF
We used DART-SingleRDF ( §D.2) as test set for automatic metric evaluation of ASPIRO.The Full results are reported in Table 8 and Reduced results in Table 9.
Full: Results in Table 8 show slight or no discrepancies in the metrics between all the experiments, which could be attributed to variational error.Considering the high API model costs ( §F), we did not run DART experiments multiple times to provide deviations.Instead, we reduce the data to only problematic samples by taking a subset of y r 0 = L 0 (x r ) generated templates which satisfy PARENT F 1 (y r 0 ) < 0.7.In other words, we take a subset of samples, for which outputs of 0-shot model in the respective sub-table were flagged as inconsistent by the Consistency Validator ( §3.2) using µ = 0.7.We report the same metric evaluation process in Table 9.
Discussion: For the Reduced evaluation in Table 9, we found that ASPIRO shows significant improvement only when (A)SDOT is initial prompt and only with 5-shot gpt-3.5-turbosetting.A point of interest is also the Neutral (irrelevance) score (N %), which the 5-shot setting generally increases, suggesting the N -shot setting is reducing the relevance of generated verbalisations to the references.For JSON prompts 1-shot gpt-4 setting has slight, albeit marginal lead over other settings.

E.2 Performance on WebNLG
We additionally evaluated performance on WebNLG ( §D.4), following a similar approach as with Rel2Text in the main experiments ( §4).GPT-3.5-turbo-0301 is used as LLM instances of all calls to both N-shot generator LLM stack and Consistency Validator.
Parsing errors: We observed (Table 10) that for WebNLG, ASPIRO is generally not able to fix any errors and CV conversely increases the number of total errors, making 3 of the templates more "flawed" than without CV.

Automatic metrics:
We compare the templates generated by ASPIRO to the manually crafted templates from Kasner and Dusek (2022) to evaluate lexical similarity using PARENT, BLEU and ME-TEOR.The results, seen in Table 11, are marginal at best and we can only observe improvement in BLEU score, while PARENT F1 and METEOR are highest for zero-shot setting.Due to time restraints, we did not include BLEURT and NUBIA evaluations.
Conclusion: Contrary to our original belief, we can conclude that ASPIRO pipeline does not pro-vide significant improvement over 0-shot method on the WebNLG dataset.

F Run time and cost of ASPIRO
$191 US is the overall expenditure for OpenAI API calls during our experiments.However, it is important to note that we made many redundant API calls in the development of ASPIRO so the necessary costs should be lower.The main bulk of the costs amounts to GPT3-davinci and textdavinci-003 calls.

F.1 Run time analysis
Table 12 presents the average run time in seconds across five experimental runs using the WebNLG dataset, which comprises 354 unique relations.This translates to a total of 354 calls required for the 0x model, a zero-shot call to the initial model (GPT3.5-turbo).Subsequent retry shots only need as many calls as there are templates with parsing errors.
Cumulative mean time: Given the nature of our experiments, where subsequent runs leverage results from preceding runs (for instance, the 2-shot run utilises results from the 1-shot run and only re-prompts those with parsing errors), we introduce Cumulative mean time to illustrate the total time necessary to execute all shots of the respective experiment.

F.2 Estimated API call costs
For a "worst-case-scenario" cost estimate of AS-PIRO (all templates are tagged for retry shot), we made calculations for the GPT3.5-turbomodel, which charges $0.002 per 1000 tokens (as of the time of our experiments).Table 13 provides cost estimations for the experiments conducted on the Rel2Text, WebNLG, and DART datasets using GPT3.5-turbo.To derive the costs associated with the GPT3-davinci or text-davinci-003 models (charged at $0.02 per 1000 tokens), multiply the presented Table 13 values by a factor of 10.

Figure 2 :
Figure 2: Example flow of ASPIRO pipeline with input sample x r = [⟨Mario, creator, Shigeru Miyamoto⟩] purely algorithmic module, which validates y n against a set of conditions {C} one by one.If y n

LLM Stack LLM Stack LLM Stack Consistency Prompt Retry Prompt Final Completion N-Shot Generator Consistency Validator Data Entry Initial Prompt Rule-based Parser &
Figure 1: ASPIRO pipeline for general input sample x r ∈ X.

shot 1) Consistency Prompt PARENT score inputs no err
"<o> is the creator of the game Mario."Completion <o> is the creator of the game Mario.Above, the Completion did not satisfy the constraints given in the Prompt.Details: "Output is missing <s> placeholder."Please try again.Prompt From the given [data], make a simple general sentence where <s> and <o> replace the subject and object.[data]: [ Mario, creator, Shigeru Miyamoto ] [general sentence]: err: "Output is missing <s> placeholder."Rule-based Parser Your task is to evaluate a [string] based on the following [rules] and output a json [rules]: ... -[string] must concisely and factually represent the information in [relation], without embellishment or unnecessary details. [string]: "<o> is the creator of the game <s>." [relation]: creator N-shot gen.(

Table 2 :
LLM variants used in our experiments.
Kasner et al. (2023)2Text test set ( §D.3) withKasner et al. (2023)'s fine-tuned BART-BASE models.In section §4.2 we report on the number of parsing errors tagged by our Rule-based parser ( §3.1) on both DART ( §D.1) and Rel2Text ( §D.3) datasets.In §4.3 we also provide brief ablation study of CV.Setup: For N -shot generator ( §3.1), L 0 marks initial model choice and N xL n max N retry shots using model L n .We limit our experiments to L n being same for all N shots.For Consistency Validator ( §3.2), we set µ = 0.7 and only use it in some setups (marked by L C in brackets).For reference on LLMs used as L in ASPIRO setups, see Tab. 2.Prompts: While Retry prompt T R ( §G.3) and Consistency prompt T C ( §G.4) are constant across all our experiments, we compare two variants of the Initial prompt T I : (A) ASDOT: proposed by Xiang et al. (We evaluate Automatic metrics on Rel2Text test set ( §D.3) with 4 ASPIRO setups (see Table 1 for 5 run averages; Table 7 for standard deviations).ASPIRO outperforms Kasner et al. (

Table 3 :
DART dataset counts of templates with parsing errors.RR %: error Rate Reduction percentage of best N -shot setup (bold) vs 0shot model.

Table 4 :
Rel2Text counts of templates with errors.

Table 6 :
Individual error counts (|E|) without and with CV on Rel2Text test set with (A)SDOT prompt.

Table 10 :
Number of errors tagged in generated templates by our Rule-based parser ( §3.1.2) for different experiment setups with gpt3.5-turbo-0301 on the full Enriched WebNLG v1.4 dataset.All models (base, retry and CV) are instances of GPT-3.5-turbo-0301.Multiple SUBJECTs column was ommitted (0 for all experiments).