PAUQ : Text-to-SQL in Russian

,


Introduction
Semantic parsing (SP) is the task of transforming a natural language (NL) utterance to a formal machine-understandable representation.Such representations include a variety of forms: from parsing linguistic features to generating a code in a specific programming language.In this paper, we focus on a SP subtask -mapping of NL questions to Structured Query Language (text-to-SQL).
Machine learning systems which are based on supervised learning and aimed at solving text-to-SQL, continue to appear and evolve actively (Kim and Lee, 2021;Cai et al., 2021;Gan et al., 2022).
The key ingredient in the process of training is the data: a parallel corpus of NL questions and corresponding SQL queries along with the databases.The majority of text-to-SQL datasets consist of questions and database contents written in English.As a consequence, the development of such models has been limited to the English language.Spider dataset (Yu et al., 2018) remains one of the most popular benchmarks in the field due to the variety of domains and complexity of the questions.
This paper presents the Russian version of the Spider: PAUQ1 , the first text-to-SQL dataset in Russian2 .In PAUQ, all three components have been modified and localized: the NL questions, the SQL queries, and the content of the databases.During this in-depth work, we discover several limitations of the original Spider and propose ways to overcome them.The new version of Spider database collection is presented.Apart from being bilingual, it is more complete -in order to use not only exact match as the evaluation metric but also execution accuracy, avoiding falsepositive results.We also complement the dataset with the new samples of underrepresented types, including questions regarding columns with binary values, columns containing date and time values, and the ones that have a fuzzy and partial match with the database content.
We adapt and evaluate on PAUQ two strong ML models relating to different types of architecture: RAT-SQL (grammar-based) and BRIDGE (sequence-to-sequence).We compare the performance of these models in terms of component understanding of SQL and schema linking.Our evaluation demonstrates that both models present strong results with monolingual training and improved accuracy in a multilingual set-up.
Another important result of this work is the development of functional test sets, subsets of samples in English and Russian with particular features.They can be used by other researchers to evaluate the models' performance on particular types of questions -thus, in a more precise and informative way.Our experiments with machinetranslated data suggest that in terms of resources expended and model performance, it could be beneficial to use machine translation for easy cases (such as questions not containing values) and manual translation for questions that imply referring to the database values.We regard this work as an important step in Russian text-to-SQL.Furthermore, we hope that our contributions will help other researchers in adaptation of Spider to other languages.

Related Work
Text-to-SQL is a topic that has been actively studied in recent years with a number of datasets and benchmarks (Hemphill et al., 1990;Zelle and Mooney, 1996;Ana-Maria Popescu and Kautz, 2003;Li and Jagadish, 2014;Navid Yaghmazadeh and Dillig, 2017;Victor Zhong and Socher, 2017).The goal of the Spider benchmark is to develop NL interfaces to cross-domain databases.It consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains.In Spider 1.0, different complex SQL queries and databases appear in train and test sets.To do well on it, systems must generalize well to not only new SQL queries but also new database schemas.
There have been several attempts to adapt the Spider dataset to other languages.The first one is Chinese Spider (CSpider) (Min et al., 2019) with original questions from the Spider dataset translated from English into Chinese.In this case, the translation is not literal: a specific cultural localization of some values was conducted.The principles of this localization, however, were not explicitly stated.Authors also compare results obtained by the models trained on the machine-translated questions with the results of the models that were trained on NL queries translated manually and subjected to localization.The study shows that models trained on human translation significantly outperform machine-translation-based.This fact highlights the idea that adaptation of English datasets to other languages requires professional translation and localization.However, databases content in CSpider was not complemented by new localized values, so its impossible to make a complete evaluation of the obtained results.
Another two specific versions of the Spider dataset are Vietnamese (Nguyen et al., 2020) and Portuguese (José and Cozman, 2021).In the case of Vietnamese, not only NL questions were translated, but also the database schema, including table and column names along with values in SQL queries.Translated values, unlike those of Chinese Spider, were not localized; they also were not added to the databases content.In line with the experimental setup of research on CSpider, a comparison between manually-translated and machine-translated versions of the Vietnamese dataset was conducted.In Portuguese Spider only NL questions, excluding database values, were translated, thus no modifications affecting the database schema or content were made.As can be seen, in all currently existing versions of the Spider dataset, although they differ in principles by which the language adaptation is carried out, not all components of the original dataset are modified: database content is left unchanged and isn't aligned with the questions.It also means that execution accuracy can't be used to measure models' performance, thus only exact match is reported.

Dataset
When managing the translation process of the original Spider dataset into Russian, we were guided by a similar experience with the translation into Chinese (Min et al., 2019).After careful review of the Chinese Spider (CSpider), we identified two aspects that were not reflected in it -and which we would like to take into account: the formulation of the principles according to which the original English query has to be modified during translation in order to comply with Russian language and culture; the insertion of the modified values into database content so that models could be evaluated with the execution accuracy correctly.As database content is enriched by Russian values, we should pay regard to another significant issue -unambiguity of the questions.This principle also means that Russian values shouldn't be direct translations of English values as in this case it's not obvious what results we expect to get: should SQL queries be formulated so that English values are taken into consideration or should it be Rus-sian values or, perhaps, a combination of both languages.For real-world databases in Russia, just like in Brazil, according to (José and Cozman, 2021), it is common to have English table and column names although the content can be in Russian.For this reason, all table and column names in SQL queries and database schemas remain in English.
The main principle of translation is the following: if there are only tables/columns names in an SQL query and no values, such query and corresponding text question are not subjected to localization or modification -they are just translated; in case an SQL query contains values, this query as well as corresponding text question are changed on the basis of several criteria foremost among which is the following: Russian values should not be literal translations of English ones, but rather should be close analogues that are at the same time absent in the original database content.Principles of values localization could be find in Appendix A.
The translation from English to Russian is undertaken by a professional human translator.The original Spider dataset consists of 10,181 samples, however the test set is closed which means that only 9,691 of them are available for processing.Preliminary work includes dividing questions into those that have values -thus, need modifications -and those that do not.For questions with values, lists with all corresponding column's values from database are obtained in order to ease the selection process of Russian values that do not have English direct analogues in the database.The translator is in advance informed of the SQL query structure and provided with the instruction in which all basic principles of the translation are stated.The translator is encouraged to translate question as closely to the original text as possible and rephrase it as necessary to obtain most natural Russian text.We also strive for a variety of questions, that is why no textual patterns are used.
As regards the updating of the database content, we added all localized/modified values to the corresponding fields.Without this, fair evaluation of the models with the execution accuracy would be impossible as for the questions containing absent values, the ground truth and predicted requestsincorrect as well as correct -would return the same result when executed.In case of binary values, we delete English analogues in order to avoid ambiguity and keep the column's values binary.
A verification of the translated questions and their conformity with the queries, and an updating of the databases are undertaken by 4 computer science students.At first, all databases with corresponding questions in Russian and SQL queries are divided between annotators.Each of the students checks the correctness of a question and a query; if the values in the query has been changed (there is also a special mark indicating it), the annotator adds the new values to the corresponding fields of the database table(s) so that when the database is accessed by the query, the added values are retrieved.After that the very same question and query are cross-checked by another annotator who also makes sure that the database request works as expected.
We made a revision of the original Spider dataset: for those SQL queries that return None or empty list, corresponding values are added to the original databases; databases that were empty (geo, scholar, yelp, academic, imdb, restaurants) are filled with values.Further information on these changes could be find in subsection 4.1.
Following research that has been conducted previously on building the Spider dataset for other languages, we also constructed machinetranslated (MT) dataset, using Yandex Translate API (Yandex, 2022).The experiments with Chinese and Vietnamese Spider versions show that the performance of the models trained on such data drops compared to that of the models trained on human-translated dataset.The quality of MT dataset certainly depends on the quality of the machine translation model itself; in addition, there is even more problematic issue -the translation of values.Without human revision and database updating, values in NL questions will not match database content in most cases.In order to examine the trade-off between the performance of the models trained on the data and the resources spent to obtain this data, we also experiment with the combined dataset, in which questions without values are machine-translated whereas for questions that include values, manual translation is used to avoid inconsistency (see subsection 5.2).

Comparative Analysis
Besides manual translation of the questions from the original Spider we (i) add new samples, (ii) insert in the tables new values, and (iii) compose a suit of functional test sets.In order to assess whether the basic properties of the Spider are not violated and whether its limitations are improved, we make a detailed analysis of the datasets and present it in the Appendix B.Here we outline the motivation for the updates, highlight the main differences and define an approach for estimation of the key dataset features like complexity, balance and diversity.

Component Analysis
The dataset for text-to-SQL task consists of three components: (i) the content of the databases, (ii) NL questions, and (iii) SQL queries.We divide our comparative analysis of Spider and PAUQ into three parts, respectively.

Databases Contents
The number of values in PAUQ increased by 2% compared to Spider.The main reasons for this are the following: the requirement that there should be no ambiguity between Russian and English elements (so all new Cyrillic values are unique and have no direct English analogues in the column) and the aspiration to reduce the likelihood of false-positive errors.The content of the databases and its relations with the NL questions can negatively affect the evaluation of the text-to-SQL models (Kim et al., 2020).One of the common drawbacks of execution metric is that it compares results of ground-truth and generated SQL requests and, therefore, tends to give false positive results for accidentally matching returns.The most common case is a null match when a groundtruth SQL request refers to non-existent values and coincides with many inappropriate SQL requests by return.
We analyse such errors made by the models on the original dataset and make several modifications that lead to the following changes in PAUQ compared to Spider: 1.All empty tables are filled with values.2. The amount of empty columns decreases by more than 2 times (from 86 to 32). 3.All mentioned columns have non-zero sizes.4. The requests are constructed as follows: a nonzero set of rows in the database corresponds to the set of conditions in a query.
To estimate the maintaining of the dataset balance and diversity, we calculate the set of quantitative indicators (Appendix B.1).The key results are: 1.The variety of column sizes and content increases.In PAUQ, there is less standard set of column values (in Spider, relatively low diversity is a result of automatic table generation -see Appendix B.1.2). 2. Russian values have the same range of contained token amounts.3. We found out that in Spider, values longer than 8 tokens are not mentioned in the requests.In PAUQ, a set of requests to the entities of bigger token amounts is included (Appendix B.1.3).
Mapping question parts and database entities is often the major and the most complex part of textto-SQL translation systems (Kim et al., 2020, Lei et al., 2020, Wang et al., 2021b).Thus, questions containing tokens that are present in several entities at once are particularly difficult.The challenging case is shown in Fig. 1.There are two requests mentioning the token "name".It is often encountered in the database (as column name and as a value in the column), so it's impossible simply map text tokens into entities.Hence, a model has to process the context of the request.
Thus, the amount of intersections of entity names within a database determines the potential complexity of the database set.We increase the indicator of overlapping entities in PAUQ (Appendix B.1.4),making the entity linking task harder for text-to-SQL models.

Questions
The questions translated into Russian are slightly shorter than the English ones due to peculiarities of the Russian language: see Fig. 2. Yet, the quality properties of the Spider questions are preserved (Appendix B.2.1).
To enlarge the coverage of database entities used in questions we enrich the set of PAUQ questions with new samples related to 15 tables that are not used in the original Spider question set (Appendix B.3).The number of question patterns is also increased by adding questions containing new request template words (these are the words that frame the request -imperative verbs, polite words and expressions, etc. -see Appendix B.4).

Queries
During the translation process we "repaired" more than 10 Spider samples (see Appendix B.5).The problems were mostly connected with the ambiguity of the questions and SQL structures.
One of the natural characteristics of the text-to-SQL dataset is the diversity of SQL patterns.We find out some quantitative imbalances in the set of queries (Appendix B.6).We note that these artefacts also exist in other text-to-SQL datasets like   WikiSQL and ATIS (Finegan-Dollak et al., 2018).

New requests
Our observations show that some categories of requests are underrepresented in Spider.Therefore, we added 213 new samples to PAUQ, divided into five groups to diversify and supplement existing suit and to make the dataset more balanced.These groups are the following: 1. Requests containing "long" values (containing more than 4 tokens).2. Samples containing references to one of the two possible values (which are opposed to each other) from the column (we refer to such columns as binary columns; querying this type of values is also different from referring to other types from the semantic point of view.3. Queries containing date or time filters.4. Requests with "fuzzy" men-tioning of entities (e.g. using synonyms, words reordering, etc.). 5. Queries with empty return (which conditions, for instance, don't correspond to any row of the database).This suite is separated from others to prevent false positive errors on it.
The details on new requests can be found in Appendix C.
BRIDGE and RAT-SQL (text-to-SQL models experiments with which are described in detail in the Section 5) achieve 19.2% and 21.1% of exact match accuracy, respectively, being trained on the existing English and Russian data and evaluated on the set of new samples.That is 3 times less than accuracy obtained on all Spider questions (Table 1).These results demonstrate the complexity of added examples.

Functional Test Sets
Measuring the performance of the models with a single number (e.g.exact match score) is a simple and convenient way.Yet, this makes it difficult to identify model's weak points and estimate its ability to cope with the hard cases.
The complexity of requests is defined based on the number of SQL components and is divided into four categories: "Easy", "Medium", "Hard" and "Extra hard" (Yu et al., 2018).But this division is not universe.In particular, (Lei et al., 2020) shows the example that "is actually difficult as it requires a model to perform logic reasoning", but is classified as "medium".SQL-request, corresponding to the question "What is the total surface area of the continents Asia and Europe?", has a really simple structure: SUM(surface area) FROM country WHERE continent = «Asia» OR continent = «Europe».
However, text-to-SQL models struggle to choose the correct logical connective because it differs from the connective used in the question.
One of the most effective approaches to indepth detailed analysis is the generation of evaluation suits -multiple (probably overlapping) test sets (also called challenge sets), that assess specific capabilities of a model.Now it applies to the widest range of NLP tasks, e.g., (Röttger et al., 2020) introduced a suite of functional tests for hate speech detection models, (Cho et al., 2021) proposed augmented test sets for the dialogue state tracking (DST) task.
To encourage text-to-SQL researchers to investigate the soft spots of the models, we extracted from our dataset the subsets of samples with different features divided into three classes: • Features of NL questions: (1) use of synonyms for existing database entities (e.g. in the question "What is the final station for the train 56701?", a term "final station" refer to the column name "destionation"), (2) atypical size:"short" and "long" questions (the ones for which the ratio of the number of tokens in the query to the number of tokens in the question, has extreme value), (3) questions that contain several different logical structures at once ("How many students are over 18 and do not have allergy to food type or animal type?"), etc.The choice of these particular features is based notably on the analysis of the results described in the evaluation sections of different papers (the references can be found in the Appendix C).The validity of this choice is confirmed by low results obtained on the extracted suit in comparison with the scores obtained on the whole set of questions.Thus, on all functional test sets (excluding "simple" and "extra simple" requests created for comparison reasons) the average exact match accuracy is 0.23 (BRIDGE) and 0.27 (RAT-SQL) while on the full development dataset the results are 0.55 and 0.57 correspondingly.All metric values and the list of test set descriptions are presented in Tables 9, 10, 11, 12, 13 from Appendix.

Experiments
In this paper, we provide an empirical evaluation of baseline neural network models for text-to-SQL on our dataset.To date, there are many different models present and there are three major categories of these models (Katsogiannis-Meimarakis and Koutrika, 2021): (i) sequence-to-sequence, (ii) grammar-based, and (iii) sketch-based slot filling.We have analyzed the Spider leaderboard (spi, 2022) and concluded that the sketch-based slot filling method does not stand up to the competition with the other two methods, which leaves us with two other approaches.We kept in mind the following requirements for models during our selection: 1.Both models have different nature of schema encoding, architecture, and decoding processes.This way we can test the method's performance on new datasets and languages; 2. Since we are also complementing Spider with the new database content we want to explore how it affects the result scores.Therefore, models must work with database content; 3. The models have identical encoder models for us to exclude language modeling bias.According to our requirements, we utilized two popular Spider models -RAT-SQL (grammarbased) (Wang et al. 2021b) and BRIDGE (sequence-to-sequence) (Lin et al. 2020).RAT-SQL is an encoder-decoder framework that uses relation-aware transformer within the encoder to model alignments between database schema and content and question tokens.The decoder of the model is tree-structured and generates abstract syntax tree in the context-free SQL grammar.BRIDGE, in turn, utilizes database schema and content as input to the model.It has an encoderdecoder architecture with the pointer-generator network using beam-search.The model generates queries in execution-guided order.
Within the experiments, we seek to answer to following research questions (RQs): RQ1 What is the performance difference between models which were trained on English, or on various Russian datasets, or in multilingual setup (English and Russian)?RQ2 Do we need to use qualified human translation while adapting the original dataset to different language (in our case, Russian)?RQ3 How does query component prediction differ depending on the language?
The details on hyperparameters are presented in Appendix E. mBERT base language model is used as an encoder for all experiments on Russian data and in multilingual setup.For English models, we have replaced the base encoder BERTlarge with BERT-base to get comparable results.For evaluation of Russian models (PAUQ, MT, MT + HT) we used PAUQ dev set and for English models -original Spider dev set.For all sets during training and testing, revised and complemented databases from PAUQ were used.For metrics calculation, we utilized original Spider evaluation script provided at https://github.com/taoyds/spider.Since we are working with multi-language data, we explored how training simultaneously on PAUQ and Spider would affect evaluation metrics.The experiment was conducted in the following way: we merged PAUQ with original English Spider NL -SQL pairs and used revised databases from PAUQ as target databases.After training the models on this data, we measured the performance on Spider development set and PAUQ independently.In Table 1 we present the obtained results.The answer to RQ1 is that the English model on BERT-base has better performance than all Russian models on mBERTbase.However, training systems on combined English and Russian dataset increases performance on both languages.

Human vs Machine Translation Comparison
To answer RQ2, we have compared the performance of the models trained on manually translated PAUQ dataset with those trained on machine-translated (MT) or combined (MT + HT) data.Experiments showed that MT + HT models perform on par with the models trained on PAUQ.Hence, in order to get a qualitative dataset in Russian, it is enough to resort to manual translation only for queries containing values.If we compare these results with those of the models trained on MT data, we see a decrease in execution accuracy which corresponds to weak value matching ability of such models.This happens because MT data doesn't correspond well with the database values.This observation is supported by train value match accuracy: for PAUQ and MT + HT models it has 0.69 error rate and for MT 0.94.

Structure & Logic Understanding Error Analysis
We focus on structure and logic understanding component match errors to answer RQ3.Such components correspond to the core logic of the question projected on the schema and SQL syntax in the expected query.Selected components are WHERE (operations only), SELECT (aggregations only), GROUP (no having), ORDER, AND/OR, IUEN, JOIN.We use the following evaluation setup: for every predicted query in the development dataset extract structure components that are incorrect according to the component set match metric.Then we count all errors per component and scale them by the total amount of components in the development dataset to get the distribution of errors per component.We analyze and compare predictions of MT + HT, PAUQ, RU + ENG, and Spider models.MT + HT, PAUQ, RU + ENG models are evaluated on PAUQ development set, Spider models -on the original development set.These metrics are presented in Table 3.
As we can see, for both languages models, on average, most often make incorrect predictions on JOIN, ORDER and GROUP components.English models, however, seem to perform better on these components while on other components (SELECT, WHERE, AND/OR, IUEN) the difference between models is not drastic.
We have also explored how structure and logic model's errors on Russian and English data differentiate based on the complexity of the Spider split queries in Table 2.The analysis of how predicted structure & logic components errors on PAUQ development set intersect with those on Spider development set reveals that the more complex the queries are, the more structure & logic component errors of models trained on different languages begin to intersect with each other.

Schema Linking Error Analysis
Along with semantic errors analysis, we intend to evaluate how the models trained on Russian queries cope with substituting the necessary database schema elements into the query.
To evaluate this we extracted gold and predicted database schema elements from queries generated by BRIDGE and RAT-SQL and calculated error rate per each element.These elements refer to query components such as SELECT (without aggregations), WHERE (value components), FROM, GROUP BY, ORDER BY.
Table 4 illustrates the error rates related to particular components for BRIDGE and RAT-SQL models predictions from development subsets of each of four datasets (Spider, MT + HT, PAUQ, EN + RU).This percentage is calculated relative to the total number of queries from the development subset containing such components.Our study shows, that both models have very similar schema linking errors distribution.The average number of errors related to entity linking is much higher than the average number of errors related to structure and logic understanding.English models perform better results on schema linking -apparently, BERT encoder often fails with linking entities.All models perform well on database tables names prediction: FROM components, 8% of errors.The models show lower results on column names prediction: SELECT, WHERE (columns), ORDER BY, GROUP BY components, 23% of errors for Russian data sets.Predictions made on the English set are better in terms of column names linking, they contain only 18% of such errors.

Value Match
The most notable difference is related to WHERE (values) component -it is the most difficult part for all English and Russian models.However, both models trained on PAUQ, MT+HT, RU+ENG make mistakes twice as often as those trained on Spider.Moreover, Russian MT model predictions fail to link value entities at all: since translation is made automatically, there are often no such entities in the database.That is one of the main reasons for using dataset created by annotators instead of less labour-intensive machine translation.BRIDGE model trained on English Spider performs better results on database values match than RAT-SQL because it augments model input with automatically extracted database cell values mentioned in the question to align the schema components with the NL utterance using the fuzzy match algorithm.However, since Russian is a highly inflectional language, fuzzy match works correctly in a much smaller amount of cases than in English.Thus, for Russian, both models perform reasonably poor on the value matching .

Conclusion
In this paper, we have presented the first public text-to-SQL dataset for the Russian language.

Limitations
As our dataset is an adaptation of the Spider dataset to Russian language, it indeed inherits most of Spider's limitations.First of all, the data is still 'artificial' which means that it was created by a limited number of people specifically for training and evaluating text-to-SQL models, thus it lacks the diversity and complexity of natural data formed by questions that people formulate in order to get the desired information from the database.For instance, the real-world data contain NL queries that require common sense knowledge which can't be extracted directly from the database; ambiguous questions allowing various ways of interpretation that are quite frequent and queries with window functions that make the process easier and more convenient, -all of these aren't included in the Spider dataset, as well as in our.Some of these and other limitations have already been resolved in more recent datasets (Yu et al., 2019;Hazoom et al., 2021), some others we partially fulfill by our functional testsets.
Another limitation concerns evaluation metricsexact match and execution accuracy, which are the most commonly used to evaluate text-to-SQL models performance.However, the first one is too strict and prone to false negative results (Zhong et al., 2020) while the latter is problematic with respect to spurious questions and ambiguous questions (Hazoom et al., 2021).More sophisticated metrics such as proposed in Kim et al., 2020;Hazoom et al., 2021 may be used in future work to adequately evaluate model performance.

Ethical Considerations
The presented dataset have been collected in a manner which is consistent with the terms of use of the original Spider, which is distributed under the CC BY-SA 4.0 license.We also used the original evaluation code scripts from Spider repository.The translation of queries from English to Russian is made by a professional translator; the database changes are made by annotators.All of them received fair compensation (more than the minimum wage in Moscow, Russia).We would like to thank the authors of the Spider for providing access to the original data.We also thank the translator and annotators for their time and effort.

A Principles of values localization
• abbreviations: if there is a widely used Russian-language analogue of the abbreviation, then it is translated and changed; if an unknown abbreviation is found, then it is kept in the original form, without transliteration; • names of films/songs/metrics/events/technics etc.: such names are localized, that is, replaced with an analogue known in Russian culture (which is familiar to a native speaker from the media/culture); however, a name should be kept in English, if this name is used in the media in the original form; • proper names: personal proper names are substituted by Russian-language analogues; company/brand names aren't translated; • addresses: the addresses without specifying the country are localized; • binary values (e.g."yes/no", "true/false"): such values should be translated into Russian.

B Comparative Analysis
B.1 Database Content

B.1.1 Number of Entities
To quantify the scope of newly added values, we counted the number of different elements of databases.
Since the translation affected only the values in our tables, the number of databases, tables and columns hasn't changed.The total number of values increased by 2%.

B.1.2 Table Sizes
In terms of diversity, the variety of column sizes increased from 122 variants of unique values to 151.Fig. 3 shows the distribution of column sizes for "small" tables -tables shorter than 25 rows (the biggest table contains 510 437 rows).
An outlier at 15 is the result of automatic

B.1.4 Overlapping Entities
Almost every fourth Spider entity (table name, column name, column value) contains a token from some other element, but only an eighth of them is used in some query.While adding Russian values and new requests we tried to increase both indicators -quantity of overlapping and common tokens in questions that can cause ambiguity problem for text-to-SQL models.
For precise analysis we calculate the amount of overlapping tokens found in several entities in Spider and PAUQ.The results are presented in Table 5.

• The longest question:
"Display the employee number, name (first name and last name), and salary for all employees who earn more than the average salary and who work in a department with any employee with a 'J' in their first name."(45 tokens, DB "movie_1", is the same for both languages).

B.3 Coverage by Mentions
An important advantage of Spider is a large number of multi-domain databases of different sizes.But not all DB elements are addressed in the requests.Thus, 92.0% of all tables are used, 48.3% of columns and less than 0.1% of all values (see Fig. 4).
We add new mentions of 15 tables that are not used in Spider, to PAUQ.

B.4 Request Template Words
A lot of queries contain standard request template words like different imperative verbs with or without affirmative "please".Some examples: • Please, show the most common type of ships.
• Split the number of killed ships by type.
This set is not too diverse, but is slightly expanded in the PAUQ.It is given in Table 6.

B.5.1 Corrections
During the translation process we have "repaired" more than 10 Spider samples.The problems were mostly connected with the ambiguity of the questions and logical patterns.E.g. in the query corresponds to the question "What are the names of all reviewers that have given 3 or 4 stars for reviews¿' should be used union-operator instead of "INTER-SECTION" in the Spiderground truth.List of corrections is on the repository.

B.6 SQL query sets imbalances
In the Spider queries we found several quantatative imbalances: 1.One of the constructions is found one and a half times more often than the next variant (more details in Appendix B.5).
2. Queries are dominated by the aggregation function "COUNT" that is more than half of
→ A column "name" from the table "mountain" (DB "mountain_photos") Adjective form: The number of matches in which the loser was higher than the winner.
→ SELECT COUNT( * ) FROM matches WHERE loser_ht > winner_ht (DB "wta1_1") • Partial match: only part of the tokens from the original entity name are used in the question.

Location of Webber University
→ Value "Webber International University" (DB "protein_institute").In the mentioned column "Institution" there are other values with the keywords "Webber" and "University".
• Overlapping (an important subtype of the previous type): the use of multiple intersecting mentions of several entities.

ID of race and drive for the first position
Идентификатор заезда и водителя у первого места → Columns "raceId" and "driverId" (DB "formula_1").In this table there are other columns with the keywords "race" and "driver", thus for a model it is important to understand that the word "ID" is connected not only with the word "race", but also with the word "driver".

C.5 Empty Return
As mentioned above, all queries were redesigned so that the result of their execution was non-empty set of cells.To prevent the occurrence of imbalance in the dataset we add another separate pool of samples, the conditions of which do not correspond to any row of the database.This set of queries can be easily excluded from the main dataset.
There are: • Requests with empty returns; • Zero-result requests with COUNT aggregation; • Requests with AVG (average) aggregation that refer to the empty set of rows and thus produce NaN as a part of the return.

D Functional Test Sets
The descriptions and sizes of the functional test sets can be found below (Tables 9,10,11,12,13).In all tables, exact match accuracy of BRIDGE (values on the left) and RAT-SQL (values on the right) base models trained on PAUQ without new samples described at C, are presented.

Figure 1 :
Figure 1: Example of a database containing different entities with the same names.To translate the questions in which they are mentioned, it is necessary to consider the location of the correct DB content and the context in the request.

Figure 4 :
Figure 4: Coverage of database entities by mentions in questions.Left bar in every category corresponds to Spider, right bar -to PAUQ.Dark color is for the entities from the databases, light -for the entities used in the requests.

Figure 5 :
Figure 5: Example of date and time values in Spider queries with the binary filters like 0/1, "yes" / "no", ... (A request "Number of cancelled flights" filters all flights that have value "No" in the the column "Completed"); (3) mention of the multi-sentence values in the request, etc.Other examples can be found in the Table 9 from Appendix C.
• Database features: (1) questions with entities that cause the ambiguity problem for the models (e.g. a request "Name of user with id 26" mentions column "Name" from the table "Employers", while there is the column with the same title in the table "Workers"), (2)

Table 2 :
Component errors intersection for BRIDGE (left) and RAT-SQL (right).

Table 3 :
Component error rate for BRIDGE (left), RAT-SQL (right) per structure and logic components.

Table 4 :
Schema linking error rate for BRIDGE (left), RAT-SQL (right) per database schema components.

Table 9 :
List of test sets for database features.The words defining features are highlighted by color.Most often, the logical connective linking the names of entities in the question coincides with the logical link in the request.So, it is really hard to define where it should be replaced with opposite one.This demand deep understanding of semantics.

Table 10 :
List of test sets for question logical features.