SPARQLing Database Queries from Intermediate Question Decompositions

To translate natural language questions into executable database queries, most approaches rely on a fully annotated training set. Annotating a large dataset with queries is difficult as it requires query-language expertise. We reduce this burden using grounded in databases intermediate question representations. These representations are simpler to collect and were originally crowdsourced within the Break dataset (Wolfson et al., 2020). Our pipeline consists of two parts: a neural semantic parser that converts natural language questions into the intermediate representations and a non-trainable transpiler to the SPARQL query language (a standard language for accessing knowledge graphs and semantic web). We chose SPARQL because its queries are structurally closer to our intermediate representations (compared to SQL). We observe that the execution accuracy of queries constructed by our model on the challenging Spider dataset is comparable with the state-of-the-art text-to-SQL methods trained with annotated SQL queries. Our code and data are publicly available (https://github.com/yandex-research/sparqling-queries).


Introduction
The difficulty of collecting and annotating datasets for the task of translating a natural language question to an executable database query is a significant obstacle to the progress of the technology. The most popular multi-database text-to-SQL dataset, Spider (Yu et al., 2018), has 10K questions, which is smaller compared to question answering datasets of other types: the DROP dataset with text paragraphs has 97K questions (Dua et al., 2019) and the GQA dataset with images has 22M questions (Hudson and Manning, 2019). The Spider dataset was created by 11 Yale students proficient in SQL, and it is difficult to scale such a process up.
Recently, Wolfson et al. (2020) proposed the Question Decomposition Meaning Representation, QDMR, which is a way to decompose a question into a list of "atomic" steps representing an algorithm for answering the question. Importantly, they developed a crowdsourcing pipeline to annotate QDMRs and showed that it can be used at scale: they collected 83K QDMRs for questions (all in English) coming from different datasets (including Spider) and released them in the Break dataset.
QDMRs resemble database queries but are not connected to any execution engine and cannot be run directly. Moreover, QDMRs were collected when looking only at questions and thus have no information about the database structure. Entities mentioned in QDMR steps usually have counterparts in the corresponding database but do not have links to them (grounding).
In this paper, we build a system for translating a natural language question first into QDMR and then into an executable query. We use modified QDMRs, where the entities described with text are replaced with their database groundings. Our system consists of two translators: a neural network for text-to-QDMR and a non-trainable QDMR-to-SPARQL transpiler. See Figure 1, for an illustration of our system.
In the text-to-QDMR part, we use an encoderdecoder model. Our encoder is inspired by RATtransformer  and uses BERT (Devlin et al., 2019) or GraPPa (Yu et al., 2021). Our decoder is a syntax-guided network (Yin and Neubig, 2017) designed for our version of the QDMR grammar. We trained this model with full supervision, for which we automatically grounded QDMRs for a subset of Spider questions.
In the second part of the system, our goal was to translate grounded QDMRs into one of the existing query languages to benefit from the efficiency of database software. The most natural choice would be to use SQL, but designing such a translator 1 Figure 2: Database from Figure 1 converted to the RDF format (the RDF graph). The red nodes correspond to the values from teacher table, the green ones -to the values from school table. Arcs correspond to the relations between primary key and other values of the same row (arc:tbl:col) and along the foreign keys (arc:t_src:c_src:t_tgt:c_tgt).
was difficult due to structural differences between QDMR and SQL. Instead, we implement a translator from QDMR to SPARQL, 2 which is a query language for databases in the Resource Description Framework (RDF) format (Prud'hommeaux and Seaborne, 2008;Harris and Seaborne, 2013). SPARQL is a standard made by the World Wide Web Consortium and is recognized as one of the key technologies of the semantic web. See Figure 2 for an example of the RDF database. We evaluated our system with the execution accuracy metric on the Spider dataset (splits by Wolfson et al., 2020) and compared it with two strong baselines: text-to-SQL systems BRIDGE (Lin et al., 2020) and SmBoP (Rubin and Berant, 2021) from the top of the Spider leaderboard. On the cleaned-up validation set, our system outperforms both baselines. On the test set with original annotation, our system is in-between the baselines. Additionally, we experimented with training our models on extra data: items from Break without schema but with QDMRs. This teaser experiment showed potential for further improvements.
This paper is organized as follows. Sections 2 and 3 present two main parts of our system. Section 4 contains the experimental setup, Section 5our results. We review related works in Section 6 and conclude in Section 7.
2 QDMR-to-SPARQL translator 2.1 QDMR logical forms Question Decomposition Meaning Representation (QDMR) introduced by Wolfson et al. (2020) is an intermediate format between a question in a natural language (tested in English) and an executable query in some formal query language. QDMR is a sequence of steps, and each step corresponds to a separate logical unit of the question (see Table 1). A QDMR step can refer to one of the previous steps, allowing one to organize the steps into a graph.
We work with QDMR logical forms (LF), which can be automatically obtained from the text-based QMDRs, e.g., with the rule-based method of Wolfson et al. (2020). Steps of a logical form are derived from the corresponding steps of QDMR. Each step of LF includes an operator and its arguments. We show some operators in Table 2

Grounding QDMRs in databases
QDMR logical forms are similar to the programmed queries but are not connected to any execution engine and cannot be executed directly. To execute these LFs using knowledge from a database, one needs to associate their arguments with the entities of the database: tables, columns, values. We refer to this association as grounding and provide the details below.
Arguments of LF operators can be of different types (see Table 2 and Appendix A) and some types require groundings. Type ref indicates a reference to one of the existing LF steps. Type text corresponds to a text argument that needs to be grounded to a table, column or value in the database. Type choice corresponds to the choice among a closed list of possible options, and type bool corresponds to the True/False choice.
There are also a few edge cases that require special processing. First, the value argument of the COMPARATIVE operator can be either ref or text. Second, the operator argument of AGGREGATE/GROUP can actually be grounded to a column. We introduced this exception because a database can contain only the aggregated information without information about individual instances. As the QDMR annotation is built without looking at the database it cannot distinguish the two cases. In the example of Table 1, if the database has a column num_teachers in the table school we would need to ground count to the column num_teachers.
We describe our procedure for annotating LF arguments with groundings in Section 4.2.

Executable queries in SPARQL
To convert a QDMR LF with grounded step arguments into an actually executable query, it is beneficial to translate QDMR into one of the existing query languages to use an existing efficient implementation at the test time. In this paper, we translate QDMR queries into SPARQL, a language for querying databases in the graph-based RDF format (Prud'hommeaux and Seaborne, 2008;Harris and Seaborne, 2013). Next, we briefly overview the RDF database format and SPARQL and then describe our algorithm for translating grounded LFs into SPARQL queries.
RDF format. In RDF, data is stored as a directed multi-graph, where the nodes correspond to the data elements and the arcs correspond to relations. RDF-graphs are usually defined by sets of subject-predicate-object triples, where each triple defines an arc: the subject is the source node, the predicate is the type of relation and the object is the target node.
Relational data to RDF. To evaluate our approach on the Spider dataset containing relational databases (in the SQLite format), we convert relational databases to the RDF format. The conversion is inspired by the specification of Arenas et al. (2012). For each table row of the relational database, we add to the RDF graph a set of triples corresponding to each column. For the primary key column 4 key of a table tbl, we create a triple with the self-link arc:tbl:key pointing from the key element to itself. For any other column col in the table tbl, we create a triple with the separate edge type arc:tbl:col, which connects the primary key element of a row to the corresponding element in col. For each foreign key of the database, we create an arc type arc:t_src:c_src:t_tgt:c_tgt (here the target column c_tgt has to be a key). Then we add to the RDF graph the triples with these foreignkey relations. See Figure 2 with an example of the RDF graph for the database of Figure 1.
SPARQL. In a nutshell, a SPARQL query is a set of triple patterns where some elements are replaced with variables. The execution happens by searching the RDF graph for subgraphs that match the patterns.  to the RDF graph of Figure 2 searches for pairs of nodes that are connected with arcs of type arc:school:State. Entries starting with symbol ? represent variables. See Figure 1 for an example of a more complicated query. SPARQL also supports subqueries and aggregators, the GROUP, SORT, UNION, MINUS keywords, etc. See, e.g., the Wikidata SPARQL tutorial 5 for a detailed overview of SPARQL features.
Translating grounded QDMR to SPARQL. We implemented a translator from a grounded QDMR LF into SPARQL. Note that LFs do not have a formal specification defining the execution, so our translator fills in the formal meaning. Our translator recursively constructs graph patterns that contain a result of LF steps. When processing a step, the method first constructs one or several patterns for the step arguments and then connects them into another pattern. At the beginning of the process, we request the method to construct the pattern containing the last QDMR step, which corresponds to the query output. We provide the details of our translator in Appendix A.

Text-to-QDMR parser
In this section, we describe our approach to generating a grounded QDMR LF from a given question and a database schema. Our encoder consists of BERT-like pretrained embeddings (Devlin et al., 2019;Yu et al., 2021) and a relation-aware transformer . Our decoder is an LSTM model that generates an abstract syntax tree in the depth-first traversal order (Yin and Neubig, 2017).

Encoder
In our task, the input is a sequence of question tokens and a set of database entities eligible for grounding: tables, columns, and extracted values.
To choose values from a database, we use string matching between question tokens and database values (see Appendix B). Additionally, we extract numbers and dates from the question that can be valid comparative values not from the database. To avoid ambiguity of the encoding, we combine the multiple identical values from different columns into one.  (Yu et al., 2021). The obtained representations are fed into the relationaware transformer, RAT .
RAT module. RAT ) is based on relation-aware self-attention layer (Shaw et al., 2018) . Unlike the standard self-attention in the transformer model (Vaswani et al., 2017), this layer explicitly adds embeddings r ij that encode relations between two inputs x i , x j . The RAT self-attention weights are The relations between the columns and tables come from the schema structure, e.g., the tableprimary key and foreign key relations. We also have relations based on matches: question -table and question -column matches based on the ngram comparison (Guo et al., 2019) and questionvalue matches from our value extracting procedure.

Decoder
The decoder is a recurrent model with LSTM cells that generates an abstract syntax tree (AST) in the depth-first traversal order (Yin and Neubig, 2017). At each prediction, the decoder selects one of the allowed outputs, the list of allowed outputs is defined by our QDMR grammar (see Appendix C). The output can be the grammar rule (transition to a new node in AST), the grounding choice or the previous step number (leaf nodes in AST).
To predict grammar rules, we use the same modules as in the RAT-SQL model . The decoder predicts comparator, aggregator and sort directions using the output of MLP. For table, column or value grounding, we use the pointer network attention mechanism (Vinyals et al., 2015). To predict a reference to a previous QDMR step, we use an MLP with a mask in the output softmax. To avoid incorrect QDMR output, we use several restrictions in the decoding process. Most of them are in the prediction of comparative arguments, e.g., we check type consistency (see Appendix D).

Training
We follow the RAT-SQL  training procedure in the main aspects. We use the standard teacher forcing approach for autoregressive models. We found that an additional alignment loss proposed for RAT-SQL did not lead to any improvements in our case, so we trained the models with the cross-entropy loss with label smoothing. See Appendix B for implementation details.
Augmentations. We randomly permute tables, columns and values when training. We experimented with a random choice of QDMR graph linearization at training but did not observe performance improvements. We also tried to randomly select one of the multiple available QDMR groundings, but it did not help as well.

Experiment setup 4.1 Data
For training and evaluation, we use the part of the Break dataset that corresponds to the Spider dataset. 6 Data includes questions and databases from Spider, QDMR logical forms from Break and groundings that we collected. Automatic grounding annotation is challenging, but we are able to annotate with target groundings more than 60% of the Break data (see Section 4.2). Our splits are based on the Break splits but take into account the grounding annotation. The Break dataset does not include the Spider test, as it is hidden, while the 6 The Break dataset also contains QDMRs for other textto-SQL datasets, e.g., single-database ATIS and GeoQuery. Comparison in the regime of fine-tuning on a specific database is also interesting, but baseline and our codebases failed due to the limitations of the SQL parsers (coming from Spider). This issue might be resolved by switching to a different SQL parser but it appeared technically infeasible at the time of writing.

Dataset
Train Dev Test  Break dev and test are the halves of the Spider dev. The gold QDMR and grounding annotation on the Break test is also hidden. The overall dataset statistics are shown in Table 3. We fixed typos and annotation errors in some train and dev examples. We also corrected some databases on train and dev: we deleted trailing white spaces in values (led to mismatches between SQL query and database) and added missing foreign keys (necessary for our SPARQL generator) based on the procedure of Lin et al. (2020). We kept the test questions and SQL queries unchanged from the original Spider dataset, which implied that some dataset errors could degrade comparisons of SQL and SPARQL results.

Annotating Groundings for LFs
We process LFs from the Break dataset in several stages. At the first stage, we iterate over all the operators and make their arguments compatible with our specification (see Table 2).
At the second stage, we collect candidate groundings for each argument that requires grounding. At this stage, we use all available sources of information: text-based similarity between the text argument and the names of the database entities, the corresponding SQL query from Spider, explicit linking between the question tokens and the elements of the schema released by Lei et al. (2020). Importantly, we can match the output of LF to the output of the SQL query and propagate groundings inside LF, which allows to obtain many high-quality groundings. At the third stage, we use collected candidate groundings and group them in all possible ways to obtain candidate LFs with all arguments grounded. Then, for each candidate LF, we run our QDMRto-SPARQL translator and execute the obtained query. We accept the candidate if there are no failures in the pipeline and the result of the SPARQL query equals the result of the SQL one. Finally, we included the question in the dataset if we had accepted at least one grounded LF. Note that we can accept several versions of grounding for each question. We cannot figure out which one is better at this point, so we can either pick one randomly or use all of them at training.

Evaluation Metric
For evaluation on the Spider dataset, most textto-SQL methods use the metric called exact set matching without values. This metric compares only some parts of SQL queries, e.g., values in conditions are not used, and sometimes incomplete non-executable queries can achieve high metric values. As our approach does not produce any SQL query at all, this metric is not applicable.
Instead, we use an execution-based metric, a.k.a. execution accuracy. This metric compares the results produced by the execution of queries (allowing arbitrary permutation on the output columns). Recently, the Spider leaderboard started supporting this metric, but submitting directly to the leaderboard is still not possible for us because the exposed interface requires SQL queries. We modify the Spider execution accuracy evaluation in such a way that it can support any query language that can be executed and provide results. When comparing the results of SPARQL to the results of SQL, we faced several challenges: • the order of output columns in SQL does not match the order in the question; • in Spider, when selecting relations w.r.t. argmin or argmax there is no consistent policy whether to pick all the rows satisfying the constraints or only one of them; • the order of rows in the output of SQL is stable, but the order of rows in the output of SPARQL varies depending on minor launching conditions; • in SPARQL, sorting is unstable (can arbitrarily change elements with equal sorting key values), but SQL sorting is stable; The first two points can make SQL-to-SQL comparisons invalid as well, and the others affect only SQL-to-SPARQL comparisons.
To resolve these issues, we implemented the metric supporting SQL-to-SQL, SQL-to-SPARQL, SPARQL-to-SPARQL comparisons with the following properties: • we reorder the columns of the outputs based on the columns the output values come from. If the matching fails, we try to compare output tables   with the given order of columns; • if one of the outputs is from an SQL query ending with "ORDER BY···LIMIT 1", we check that the produced one row is contained in another output; • if one of the outputs has done unstable sorting, we allow it to provide a key w.r.t. which the sorting was done and try to match the order of the rows in another output by swapping the rows with identical sorting-key values; • before comparison, we extract the column types from both outputs and convert each value to the standardized representation.

Comparison with text-to-SQL methods
First, we compare our approach to state-of-the-art text-to-SQL methods (that generate full executable queries) BRIDGE (Lin et al., 2020) and SmBoP (Rubin and Berant, 2021), both from the top of the Spider leaderboard. See Table 4 for the results. As our training data includes only 50% of the original Spider train, we add to the comparison BRIDGE and SmBoP models trained on the same data subset. We use the official implementations of both models.
All models are trained together with finetuning pretrained contextualized representations: BRIDGE encoder uses BERT, SmBoP encoder uses GraPPa, our model has both BERT and GraPPa versions.
We choose the final model of each training run of our system based on the best dev result from the last 10 checkpoints with the step of 1000 iterations. For BRIDGE and SmBoP, we used the procedures provided in the official implementations (they similarly look at the same dev set). The estimated std of our model is 0.9 on the dev set (estimated via retraining our BERT-based model with 5 different random seeds).
On the development set, our models achieve better execution accuracy than text-to-SQL parsers even trained on full Spider data. On the test set, our models outperform BRIDGE but not SmBoP when trained on the same amount. See Table 6 for qualitative results of our GraPPa-based model.
We did not include the results of RAT-SQL  in Table 4, because this model was trained to optimize exact set matching without values, so the model output contains placeholders instead of values. The model trained on full Spider reproduces the exact matching scores shown by  but gives only 40.2% execution accuracy on dev and 39.9% on test. Correct predictions mostly came from correct SQL queries without values. We also tried the available feature of value prediction in the official implementation of RAT-SQL and obtained better execution accuracy scores (48.5% on dev and 46.4% on test), but they were still very low.

Additional training data from Break
The Break dataset contains QDMR annotations for several question answering datasets, so we tried to enrich training on Spider with QDMRs from other parts of Break. Table 5 shows the execution accuracy on our dev and test in these settings. Adding training data for both versions of the model leads to performance improvement on the test set, but slightly decreases the dev set results.
When training with the data from other parts of Break, we simply assume that the schema is empty and use all the textual QDMR arguments as values. More careful exploration of additional QDMR data is left for future work.   in both models decreases the execution accuracy. Next, we tested different configurations of RATencoder: • without relations that come from the schema structure (e.g., the table -primary key and foreign key relations); • with the small number of default relations: without distinguishing table, column or value, because these elements are considered as elements of one unified grounding type; • the regular transformer instead of RAT.

Ablation study
The model without schema relations lost 11% on dev, which shows that encoding schema with RATencoder is an important part of the model. This also limits the use of additional data from Break, where schemas do not exist. The variety of relations in RAT-encoder is also important, as RAT itself. Our findings are consistent with the ablations of .

Related Work
Text-to-SQL. The community has recently made significant progress and moved from fixed-schema datasets like ATIS or GeoQuery (Popescu et al., 2003;Iyer et al., 2017) to the WikiSQL or Overnight datasets with multiple single-table schemas (Wang et al., 2015;Zhong et al., 2017) and then to the Spider dataset with multiple multi-table multi-domain schemas (Yu et al., 2018). Since the release of Spider, the accuracy has moved up from around 10% to 70%.
Most recent systems are structured as encoderdecoder networks. Encoders typically consist of a module fine-tuned from a pretrained language model like BERT (Devlin et al., 2019) and a module for incorporating the schema structure. Guo et al. (2019); Zhong et al. (2020);Lin et al. (2020) represented schemas as token sequences, Bogin et al. (2019a,b) used graph neural networks and  used relation-aware transformer, RAT, to encode a graph constructed from an input schema. In this paper, we use the RAT module to encode the schema but enlarge the encoded graph by adding value candidates as nodes.
Decoders are typically based on a grammar representing a subset of SQL and produce output tokens in the depth-first traversal order of an abstract syntax tree, AST, following Yin and Neubig (2017). A popular choice for such a grammar is to use SemQL of Guo et al. (2019) or to use a lighter grammar with more intensive consistency checks inside beam search like in BRIDGE (Lin et al., 2020). Recently, Rubin and Berant (2021) proposed a different approach to decoding based on bottom-up generating of sub-trees on top of the relational algebra of SQL. In our paper, we follow the standard AST-based approach but for the grammar describing grounded QDMRs. We also use some consistency checks and the decoding time to prevent some easily avoidable inconsistencies.
There is also a line of work on weaklysupervised learning of text-to-SQL semantic parsers, where SQL queries or logical forms for the training set are not available at all. Some works (Min et al., 2019;Wang et al., 2019;Agarwal et al., 2019;Liang et al., 2018) reported results on the WikiSQL dataset,  worked on GeoQuery and Overnight datasets. We are not aware of any works reporting weakly-supervised results on the multi-table Spider dataset.
Pretraining on text and tables. One possible direction inspired by the success of pretraining language models on large text corpora is to pretrain model on data with semantically connected text and tables. Yin et al. (2020, TaBERT) (2021, GAP) used synthetic data generated by the models for SQL-to-text and table-to-text auxiliary tasks. In this paper, we do not pretrain such models but experiment with GraPPa as the input encoder.
QDMR. Together with the Break dataset, Wolfson et al. (2020) created a task of predicting QDMRs given questions in English. As a baseline, they created a seq2seq model enhanced with a copy mechanism of Gu et al. (2016). Recently, Hasson and Berant (2021) built a QDMR parser that is based on dependency graphs and uses RAT modules. Differently from this line of work, we use a modified version of QDMRs, and our models never actually predict QDMR arguments as text but always directly their groundings.
SPARQL. SPARQL was used in several lines of work on semantic parsing for querying knowledge bases. The SEMPRE system of Berant et al. (2013) relied on SPARQL to execute logical forms on the Freebase knowledge base. Yih et al. (2016) and Talmor and Berant (2018) created the We-bQuestions and ComplexWebQuestions datasets, respecively, where annotations were provided in the form of SPARQL queries. A series of challenges on Question Answering over Linked Data Challenge, QALD (Lopez et al., 2013), and the LC-QuAD datasets (Trivedi et al., 2017;Dubey et al., 2019) targeted the generation of SPARQL queries directly. Our paper is different from these lines of work as we rely on supervision via QDMRs and not SPARQL directly.
There also exist several lines of works on converting queries from/to SPARQL, and the problems are difficult. See, e.g., the works of Michel et al.

Conclusion
In this paper, we proposed a way to use the recent QDMR format (Wolfson et al., 2020) as annotation for generating executable queries to databases given a question in a natural language. Using QDMRs is beneficial because they can be collected through crowdsourcing potentially easier than correct database queries. Our system consists of two main parts. First, we have a learned text-to-QDMR translator that we built on top of the recent RAT-SQL system  and trained on an annotated with QDMRs part of the Spider dataset. Second, we have a non-trainable QDMR-to-SPARQL translator, which generates queries executable on databases in the RDF format. We evaluated our system on the Spider dataset and showed it to perform on par with the modern textto-SQL methods (BRIDGE and SmBoP) trained with full supervision in the form of SQL queries. We also showed that additional QDMR annotations for questions not aligned with any databases could further improve the performance. The improvement shows great potential for future work.

Supplementary Material (Appendix) SPARQLing Database Queries from Intermediate Question Decompositions
A QDMR-to-SPARQL translator Table 8 contains the full list of QDMR operators used in our paper. Algorithm 1 sketches the QDMR-to-SPARQL translator. It is a recursive procedure that creates SPARQL queries for all QDMR LF steps. At its core, it constructs one or several patterns for the step arguments and then connects them into another pattern in a way specific to the LF operator of the current step.
Importantly, the patterns for LF operators can be of two types: inner (inline) and full. An inner pattern represents the internal part of a query that needs to be placed inside the curly brackets {...}. A full pattern corresponds to a full query that can be executed directly (starts with the SELECT keyword). An inner pattern can be converted to full by using the SELECT <output vars> WHERE {<inner>} construction. The full pattern can be converted to inner by creating a subquery via {<full>} (here, the output variables of <full> pattern become available in the scope where the subquery is created).
Different LF operators require and produce different patterns: inner of full. Next, we specify a pattern for each LF operator.
The SELECT operator adds the grounded object to the context: a self link for a table, a link for a column, a link with a filtering condition for a value.
The PROJECT operator creates a context for the argument and does the same as SELECT. To connect instances from different columns, we use the breadth-first search to find the shortest path in the undirected graph where all the columns of all tables represent nodes and edges appear between the primary key of each table and all other columns of the same table, and along with the foreign links.
The COMPARATIVE operator first creates an inner <pattern> for its arguments and then adds a filtering condition from the l.h.s. values <filter_var>, the operation <comparator> and the r.h.s. value <value>: <pattern> FILTER(<filter_var><comparator><value>). C ← convert C to inline/full 18:

return C
The AGGREGATE operator computes the aggregator <agg_op> from a set of values. This operator takes the inner pattern <pattern> as input (with <var> correspondings to the set of values to aggregate) and produces the full query with the output variable <output_var> as the output:

SELECT (<agg_op>(<var>) as <output_var>) WHERE { <pattern> }
The SUPERLATIVE operator filters the instances such that some related attribute has the min/max value. The operator first computes the min/max value with a built-in AGGREGATE operator then filters (similar to COMPARATIVE) the patterns based on the computed value: The SUPERLATIVE operator requires two inner patterns as input <pattern_inner>, <pattern_outer> and makes an inner pattern as the output.
The GROUP operator groups the values <var> by the equal values of the related attribute <index_var>: SELECT   The aggregation is done with the operator <agg_op>. The input pattern <pattern> is inner, and the output is the full pattern with the output variable <output_var>.
The UNION operator can actually correspond to several operators: horizontal union, vertical union, union of aggregators, union after group. By horizontal union, we mean the union of two or more related variables from the same pattern. These variables have to correspond to different database columns. By vertical union, we mean the union of two or more variables corresponding to the same column but coming from different patterns. This case is implemented with the UNION keyword from SPARQL using the following construction: The union-after-group case is a special but common situation when arguments contain the result of the GROUP operator and the index variable of the same operator. We implement this case similar to the pattern of the GROUP operator but with several variables in the output. The union of aggregators is another common special case when the arguments of the UNION contain several aggregators from the same pattern. We simply output these several aggregators by concatenating them after the SPARQL SELECT keyword.
The INTERSECT operator effectively consists in sequentially applying two COMPARATIVE operators that do not have explicit comparisons as arguments.
The DISCARD operator is based of the pattern very similar to the vertical union: The SORT operator consists in adding the ORDER BY keyword at the end of the full pattern:

B Implementation details
We implemented our model on the top of the RAT-SQL code 7 built with Pytorch (Paszke et al., 2019). We use pretrained BERT and GraPPa from the Transformers library (Wolf et al., 2020). To support SPARQL queries and RDF databases, we used two libraries: RDFLib 8 and the open-source version of the Virtuoso system. 9 RDFLib was much easier to install (a python package), but Virtuoso allowed to run SPARQL queries on pre-loaded databases much faster.
To choose relevant values from a database, we tokenized question and all unique database values using the Stanford CoreNLP library (Manning et al., 2014), filtered tokens using NLTK 10 English stopwords, and then picked top-25 values with higher similarity scores calculated as follows: • for a numeric value, we gave the maximum score if it exactly matched with some question token, otherwise, we gave the minimum score; • for other tokens, we gave the maximum score if the value and question stems were the same (we used the Porter and Snowball stemmers from NLTK), otherwise, we calculated similarity score based on the longest continuous matching subsequence (we used Python Se-quenceMatcher class).
For the neural network architecture and training, we used the same hyperparameters as RAT-SQL : 8 RAT layers, each with 8 heads and the hidden dimension of 256, 1024 and 512 in self-attention, position-wise feedforward network and decoder LSTM, respectively. We trained the model with the Adam optimizer (Kingma and Ba, 2014) and polynomial decay scheduler used by . The batch size was 24, the overall number of iterations was 81000 for all models.
The training time on 4 NVIDIA V100 GPUs was approximately 24 hours.