The Role of Output Vocabulary in T2T LMs for SPARQL Semantic Parsing

In this work, we analyse the role of output vocabulary for text-to-text (T2T) models on the task of SPARQL semantic parsing. We perform experiments within the the context of knowledge graph question answering (KGQA), where the task is to convert questions in natural language to the SPARQL query language. We observe that the query vocabulary is distinct from human vocabulary. Language Models (LMs) are pre-dominantly trained for human language tasks, and hence, if the query vocabulary is replaced with a vocabulary more attuned to the LM tokenizer, the performance of models may improve. We carry out carefully selected vocabulary substitutions on the queries and find absolute gains in the range of 17% on the GrailQA dataset.


Introduction
Knowledge Graph Question Answering (KGQA) is the task of finding answers to questions posed in natural language, using triples present in a KG.Typically the following steps are followed in KGQA: 1) Objects of interest in the natural language question are detected and linked to the KG in a step called entity linking.2) The relation between the objects is discovered and linked to the KG in a step called relation linking.3) A formal query, usually SPARQL 1 , is formed with the linked entities and relations.The query is executed on the KG to fetch the answer.
Our focus in this work is the query building phase, henceforth referred to as KGQA semantic parsing.The motivation of our work stems from Banerjee et al. (2022), where minor vocabulary substitutions to handle non-printable special characters for T5 (Raffel et al., 2020) produced better results on the task of SPARQL semantic parsing.In this † The authors contributed equally to this work 1 https://www.w3.org/TR/ rdf-sparql-query/ work, we extend the idea and replace the entire SPARQL vocabulary with alternate vocabularies.
As in Banerjee et al. (2022), we replace certain special characters in the SPARQL vocabulary, such as { , } with textual identifiers, as T5 is known to have problems dealing with these special characters (Banerjee et al., 2022).We call this a masked query, and in this work, we test the ability of the models to generate this masked query, given the natural language question as input.
A sample question, the original SPARQL query, and the corresponding masked query are as shown below (for the Wikidata KG (Vrandečić and Krötzsch, 2014)) : Is it true that an Olympic-size swimming pool's operating temperature is equal to 22.4 ?ASK WHERE { wd:Q2084454 wdt:P5066 ?obj filter(?obj = 22.4) } ASK WHERE OB ent0 rel0 ?obj filter ( ?obj = 22.4 ) CB In the era of pre-trained Language Models (LMs) (Devlin et al., 2019;Raffel et al., 2020) it is common practice to fine-tune models on custom downstream datasets.This requires supervised training which results in modification of weights of the models using some training algorithm.More recently, the technique of prompting of language models (Brown et al., 2020;Shin et al., 2020) has been developed, which elicits the desired response from a LM through a task description and a few inputoutput examples.Brown et al. (2020) shows that such a strategy works better for larger models.It has however been observed that prompt design is brittle in behaviour and displays sensitivity to the exact phrase (Shin et al., 2020).A more recent innovation is that of prompt tuning (Lester et al., 2021), where the task-specific prompt is learnt on a smaller external neural network.The gradients are computed and flow through the LM, but leave the weights of the LM itself unchanged.Instead, the weights of the prompt tuning network change and produce a custom and continuous prompt which produces the desirable response from the LM.
A similar method is prefix tuning (Li and Liang, 2021), which is known to perform better for generation tasks (Ma et al., 2022).In this method, the original inputs and outputs are kept the same, but the input is pre-pended with a continuous prefix learnt in the external network.This prefix allows the model to understand the exact task to be performed by it.
As primary contribution, in this work, we perform an analysis of how the complexity of output vocabularies affects the performance on the KGQA semantic parsing task for prefix and finetuned language models.Code and data can be found at https://github.com/debayan/sparql-vocab-substitution.

Related Work
A study of low-resource semantic parsing using prompt tuning was performed by Schucher et al. (2022) on the Top v2 (Chen et al., 2020) and Overnight (Wang et al., 2015) datasets.Prompt tuning, while not the same as prefix tuning, still keeps the LM weights frozen while the prompts are learnt on an external network.In their experiments, they perform a single kind of vocabulary substitution but find no noticeable performance improvements.No specific study is made of the change in performance with vocabularies of varying complexities, which is a task we undertake.Another difference is that we perform experiments in the high-resource use case as opposed to low-resource.
Another work which is similar to ours is Sun et al. (2022), where the authors experiment with prefix tuning on the task of semantic parsing, and find problems with non-standard vocabularies of logical forms.In their case, they work with the TOP v2 (Chen et al., 2020) and PIZZA (Arkoudas et al., 2022) datasets.The keywords in those datasets consist of words joined by underscores (eg: IN:GET_REMINDER_DATA_TIME ), which poses a problem for the sub-word tokenizer of the transformer based models.They find that fine tuning a model on these datasets outperforms prefixtuning by a large margin.However, when they add the non-standard keywords to the tokenizer vocabulary and re-train the tokenizer to generate new embeddings for these keywords, fine tuning and prefix tuning perform at par.Our work is different in a few respects: firstly, due to the specific research focus of our group, we experiment with a semantic parsing dataset for KGQA, namely GrailQA (Gu et al., 2021).Secondly, instead of retraining the tokenizer, we perform a simpler procedure of pre-processing the dataset by replacing the current vocabulary with a new vocabulary.We then train the models on this modified dataset, and as a post-processing step, substitute back the original vocabulary in place of the new vocabulary.

Prefix Tuning
Prefix tuning prepends a set of tunable weights to every key-value pair in the transformer attention.The transformer attention is represented as follows: where the query Q, key K and value V are obtained through affine transformations on the input.d represents the model dimension.Prefix tuning modifies the transformer attention by adding tunable prefixes to K and V , thereby modifying K as Here h K and h V represent the key prefix and the value prefix respectively.Following Li and Liang (2021) we model these prefixes using a two layer MLP as follows: where W ∈ R d×d and b ∈ R d are trainable weights and biases respectively.E ∈ R C×d is a trainable embedding matrix with C as the prefix length.

Models and Experimental Setup
We carry out prefix-tuning and fine-tuning experiments with two versions of the T5 model: namely T5-Small (60 million parameters) and T5-Base (220 million parameters).Questions are fed as input during training while masked SPARQL queries, as described in Section 1, are provided as labels for supervision.For evaluation, we use the exact-match metric.A generated query is matched token by token, while ignoring white-spaces, to the gold query.The percentage of queries matched is reported.

Hyper-parameters and Implementation Details
Throughout our experiments, the prefix length is fixed to 50.For prefix tuning experiments we use the Adafactor (Shazeer and Stern, 2018) optimizer with a constant learning rate of 0.001.Finetuning experiments are optimized through AdamW (Loshchilov and Hutter, 2019) with a square root decay schedule, a maximum learning rate of 0.0015 and a linear warm-up of 5000 steps.Our code is implemented with HuggingFace Transformers2 (Wolf et al., 2020) and OpenPrompt3 (Ding et al., 2022).T5-Small experiments were run on 12GB Nvidia GTX-1080 and RTX-2080 GPUs, and T5-Base experiments were run on 48GB Nvidia RTX-A6000.For fine-tuning, we run each training thrice with three separate seeds for 120 epochs each.For prompt tuning we do the same for 400 epochs.We report the inference results of these trained models on the test sets of the respective datasets.

Vocabulary
The original vocabulary of the GrailQA dataset consists of 48 words.The T5 tokenizer splits these words into 124 sub-words.This tokenizer specific vocabulary size (TSVS) is seen in the last column of Table 1.In the next column, the original average logical form (SPARQL query) length can be seen as 125 tokenized sub-words.
We wish to see how a new output vocabulary affects performance, and as a result, we construct a set of special vocabularies and substitute them in-place of the original SPARQL vocabulary.With reference to the settings in Table 1, each vocabulary is as described below: original The masked SPARQL queries remain as they are.No replacement of the original SPARQL keywords is made with an alternate vocabulary.
dictionary The SPARQL keywords are replaced with a vocabulary of English words.For example, SELECT may be replaced with DOG, [ may be replaced with CAT etc.During the pre-training phase a LM is likely to have seen such words far more frequently than the SPARQL keywords.This mode tests how the model behaves when the output vocabulary is comprised of well known English words.
char1 The SPARQL keywords are replaced with a single character of the English alphabet, for example, SELECT is replaced with A, WHERE is replaced with B. Additionally, numerical digits from 1-9 are used, and if the size of vocabulary demands more, we add single length special characters, such as * and $. char2, char4 and char8 settings apply vocabulary substitution of 2, 4 and 8 character lengths chosen randomly, constituted from the characters A-Z and digits 0-9.For example, a typical char8 substitution would be SELECT replaced by ATYZGFSD.This setting is designed to test the behaviour of the models when asked to produce more number of tokens per original-vocabulary word.A sample of a question, the SPARQL and the corresponding substitutions is provided in the Appendix in Table 2.

Datasets
For our experiments, we require a dataset which contains a mapping of natural language questions to their corresponding logical forms and is large in size, since we test the high resource use-case.
GrailQA 4 is based on the Freebase knowledge graph (Bollacker et al., 2008) and consists of 64,331 questions designed to test three levels of generalisation, ie, i.i.d, compositional and zeroshot.For our purposes, we split the train set itself to three parts, since we are not interested in testing compositional generalisation aspects of the test set of this dataset.We are left with the following configuration: test: 8868, dev: 4434, train: 31035.

Analysis
As seen in Table 1, the best performance for prefix and fine tuning is achieved for substituted vocabularies.The original vocabulary lags behind in general, which points to the finding, that the choice of an appropriate vocabulary improves performance for semantic parsing.Further, among the substituted vocabularies, the setting char8 performs the worst, which signifies the adverse role of the extra decoding load of this vocabulary on the performance of the model.
This finding is different from that of Schucher et al. (2022), who find their in-vocab setting performing no better overall.They attribute it to the substitutions possibly masking the meanings of the intents, for their given dataset.On the contrary, we find significant gains for GrailQA.It must be noted however, that we perform high-resource prefix tuning while they perform low-resource prompt tuning, and hence results may differ.
As seen in Figure 1, for the char settings, as the size of vocabulary increases, the prefix tuning accuracy drops.In the said figure, we define vocabulary compression ratio as the size of the new vocabulary divided by the size of the original vocabulary.Apart from vocabulary size, the query length also matters.We dual-define vocabulary compression ratio as the size of query length after substitution of new vocabulary divided by size of original query length, and plot on the same graph.
When compared to the fine-tuning plot (Figure 2), prefix tuning has a steeper drop in accuracy, and the performance for T5-Small and T5-Base vary more significantly.It leads to the finding that finetuning is less sensitive to vocabulary changes, and the difference in model sizes between T5-Small and T5-Base also seems to matter less.
In Figures 1 and 2, it can be seen that the original setting for the masked SPARQL vocabularies produce accuracies which are below the char family vocabulary curves.It suggests that vocabulary compression ratio alone is not a deciding factor in accuracy.If the vocabulary family changes from SPARQL to characters, there is an initial shift in accuracy, and after that the complexity of the character vocabulary further affects the accuracy.
In Table 1, the dictionary setting performs slightly worse than the char1 setting, although it has lower TSVS and ALFL.This suggests that the vocabulary size and query length are not the only factors that affect the eventual accuracy.Perhaps the frequency of the tokens seen by the model during the pre-training task plays a role.It is likely that the model has encountered, during pre-training, single characters a far larger number of times than the words used in dictionary vocabulary.

Error Analysis
We performed an error analysis on a sample of 100 randomly selected questions which produced an incorrect output.In the original setting, roughly 50% errors were due to the presence of non-printable characters in the query (eg: ^).We found that in the initial masked query, while we had replaced some non-printable characters in the pre-processing stage (eg: {, } ), we had not managed to replace the full set of non-printable characters.The original T5 paper mentions curly braces as one of the class of tokens that are not present in the pre-training corpus, however, a comprehensive list of the tokens that do not work with T5, or work with limited efficiency, is not available.In this scenario, it seems that a better approach is to replace the entire vocabulary with one that is entirely known to T5, for example, English words.When comparing errors made by original, that were fixed by dictionary and char1, we observed that roughly 30% of the cases were of variable placement, where the variable placeholders like ent0, rel0 were found to be in the wrong order in the output query in the original setting.Rest of the corrections belonged to the category of syntax errors.This points to the finding that alternate vocabularies improve the ability of T5 to correctly produce logical forms from a semantic perspective.
To analyse the effect of increasing complexity of vocabulary, we compare 100 randomly selected errors made by char8 with char2.In both these settings, no character is non-printable, and the only errors are either syntax errors, variable placement errors, structural errors or intent errors.Out of the 100 questions, 90 were found to be correct in char2 setting.In the remaining 90 in the char8 setting, the highest proportion of errors belonged to syntax (where the query is malformed).The next most prominent class of errors belonged to variable placement, followed by structural errors (eg: two triples instead of three).The major takeaway from this analysis is that for char2 there were no syntax errors, while in char8 there are a significant number of such errors.

Conclusion
In this work we carried out experiments with new output vocabularies, where we carefully substituted the original members of the vocabulary with the new ones.We found that when the original SPARQL vocabulary is replaced with words from an alternate vocabulary closer to the T5 tokenizer vocabulary, the model consistently perform better.
As a contribution, we believe that our findings will enable researchers in the field of semantic parsing to deploy smaller models with a modified vocabulary and still find satisfactory performance.This would, in the longer term, lead to energy savings.
As future work, we would like to explore the behaviour of the same models in more depth using attention maps.Moreover, the significant shift in initial performance on changing vocabulary from original to char and dictionary demands further investigation.Similarly, the relatively lower performance of the dictionary setting when compared to char1 setting, in spite of having lower tokenized vocabulary size (TSVS) needs to be investigated further.Perhaps sub-words which are seen more frequently during pre-training task of the LM perform better when substituted into the semantic parsing output vocabulary.

Limitations
We found that prefix tuning takes much longer to converge when compared to fine tuning, and for T5-Base, it takes around 10 days on a 48 GB GPU to complete tuning for a single setting in Table 1.Due to limitation of resources and with an aim to save energy, we did not conduct experiments with larger models such as T5-Large, T5-XL etc.We also did not perform experiments with smaller splits of the same datasets, which could have given further insights on how model performance varies when training data size is less.D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: Prefix tuning accuracy drops as vocabulary and query lengths increase for char settings.TSVS = Tokenizer specific vocabulary size, ALFL = Average logical form length

C2.
Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?4.1 C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?7 C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)? 4.1 D Did you use human annotators (e.g., crowdworkers) or research with human participants?Left blank.

Table 1 :
Exact match percentages for generated masked SPARQL queries.Best performance is always found in substituted vocabularies.For char settings, accuracy drops as vocabulary and query lengths increase.TSVS = Tokenizer specific vocabulary size, ALFL = Average logical form length, PT = Prefix Tuning, FT = Fine Tuning

Table 2 :
An example of a question from GrailQA, with the corresponding SPARQL query, and how they look once new vocabularies are substituted.