Chase: A Large-Scale and Pragmatic Chinese Dataset for Cross-Database Context-Dependent Text-to-SQL

The cross-database context-dependent Text-to-SQL (XDTS) problem has attracted considerable attention in recent years due to its wide range of potential applications. However, we identify two biases in existing datasets for XDTS: (1) a high proportion of context-independent questions and (2) a high proportion of easy SQL queries. These biases conceal the major challenges in XDTS to some extent. In this work, we present Chase, a large-scale and pragmatic Chinese dataset for XDTS. It consists of 5,459 coherent question sequences (17,940 questions with their SQL queries annotated) over 280 databases, in which only 35% of questions are context-independent, and 28% of SQL queries are easy. We experiment on Chase with three state-of-the-art XDTS approaches. The best approach only achieves an exact match accuracy of 40% over all questions and 16% over all question sequences, indicating that Chase highlights the challenging problems of XDTS. We believe that XDTS can provide fertile soil for addressing the problems.


Introduction
The problem of mapping a natural language utterance into an executable SQL query in the crossdatabase and context-dependent setting has attracted considerable attention due to its wide range of applications (Wang et al., 2020b;Zhong et al., 2020). This problem is notoriously challenging, due to the complex contextual dependencies among questions in a sequence. Consider the question sequence in Figure 1. In order to understand the last question, one needs to figure out the elliptical object of the verb "培养(have)" from the first two questions in the sequence, which is "状 元 球员(first pick player)". Questions like this are context-dependent, since they require resolutions of contextual dependencies such as ellipsis in this question. There are also context-independent questions that can be understood individually, such as the first question in Figure 1. For ease of reference, we refer to this cross-database context-dependent Text-to-SQL problem as XDTS. To study the challenges in XDTS, a continuous effort has been dedicated to constructing datasets, including SParC (Yu et al., 2019a) and CoSQL (Yu et al., 2019b).
However, through a careful analysis on existing datasets, we identify two biases in them and these biases conceal the major challenges in XDTS to some extent. First, there are only a limited number of context-dependent questions in existing datasets. Specifically, only 32% of questions in CoSQL are context-dependent, and only 66% of question sequences have context-dependent questions. SParC has more context-dependent ques-tions than CoSQL, but it still has 48% of contextindependent questions. Such a limited number of context-dependent questions is unexpected, because prior work (Bertomeu et al., 2006) has shown that questions within a database dialogue are highly likely to be context-dependent, and how to effectively model the context to understand a context-dependent question is one of the major challenges in XDTS. Second, 40% of SQL queries in both SParC and CoSQL are particularly easy, involving at most one condition expression. This biased distribution of SQL queries is potentially caused by their construction methods. In fact, we find that SQL queries for question sequences created from scratch are much more challenging.
Upon identifying the limitations of existing datasets, we present CHASE, a large-scale and pragmatic Chinese dataset for XDTS. CHASE consists of 5,459 question sequences (17,940 questions with their SQL queries annotated) over 280 multitable relational databases. Compared with SParC and CoSQL, the number of context-independent questions in CHASE is reduced from 48% and 68% to 35%, and the number of easy SQL queries is reduced from 40% and 41% to 28%. Moreover, CHASE has richer semantic annotations, including the contextual dependency and schema linking (Lei et al., 2020) of each question. CHASE is also the first Chinese dataset for XDTS. CHASE is made up of two parts: CHASE-C and CHASE-T. In CHASE-C, we recruit 12 Chinese college students who are proficient in SQL to create question sequences from scratch and annotate corresponding SQL queries. To ensure the diversity and cohesion of question sequences, we propose an intent recommendation method. When a student is going to raise a question, an intent category is randomly sampled with the method, and the student is recommended to write the question and SQL query according to it. In CHASE-T, inspired by the construction of CSpider (Min et al., 2019), we translate all the questions, SQL queries, and databases in SParC from English to Chinese. We also try our best to mitigate the biases in SParC.
To understand the characteristics of CHASE, we conduct a detailed data analysis and experiment with three state-of-the-art (SOTA) XDTS approaches, namely, EditSQL (Zhang et al., 2019), IGSQL (Cai and Wan, 2020), and our extension of RAT-SQL (Wang et al., 2020a). The best approach only achieves an exact match accuracy of 40% over all questions and 16% over all question sequences, indicating that CHASE presents significant challenges for future research. The dataset, benchmark approaches, and our annotation tools are available at https://xjtu-intsoft.github.io/chase.
In summary, this paper makes the following main contributions: • We identify two biases in existing datasets for XDTS: (1) a high proportion of contextindependent questions and (2) a high proportion of easy SQL queries.
• We propose an intent recommendation method to guide the question sequence creation. The analysis on CHASE shows that our method is useful to enrich the diversity and cohesion of question sequences.
• CHASE, to the best of our knowledge, is the first large-scale and pragmatic Chinese dataset for XDTS. Experimental results on CHASE with three state-of-the-art approaches show that there is still a long way to solve the challenging problems of XDTS.

Study of Existing Datasets
In this section, we first formally define the problem of XDTS and its evaluation metrics. Then, we present our study to understand the limitations and biases of existing datasets in Contextual Dependency and SQL Hardness Distribution.

Definition of XDTS
Let Q i = q i 1 , · · · , q i n and Y i = y i 1 , · · · , y i n denote a question sequence and its SQL queries, where q i j is the j-th question in Q i and y i j is the corresponding SQL query for q i j . Given a database DB i , a question q i j , and the question's context q i 1 , · · · , q i j−1 , the goal of XDTS is to generate the SQL query y i j for q i j . An XDTS dataset is a set of question sequences . Two metrics are widely used to evaluate the prediction accuracy for XDTS: Question Match and Interaction Match. Question Match is 1 when the predicted SQL query of q i j matches y i j . 1 Interaction Match is 1 when all predicted SQL queries of Q i match Y i .

Study Setup
Dataset There are two datasets for studying XDTS, all of which are English corpora.
(1) SParC (Yu et al., 2019b) SParC is the first dataset for XDTS. It is constructed upon the Spider dataset (Yu et al., 2018). Given a pair of question and SQL query chosen from Spider, an annotator was asked to write a sequence of questions to achieve the gold specified in the chosen pair.
(2) CoSQL (Yu et al., 2019a) CoSQL is a corpus for task-oriented dialogue. It uses SQL queries for dialogue state tracking. Hence, it is also used to study XDTS. Question sequences in CoSQL were collected under the Wizard-of-Oz setup (Kelley, 1984). An annotator was assigned a pair of question and SQL query chosen from Spider, and she was asked to raise interrelated questions towards the goal specified in the pair. Another annotator wrote the SQL query for the question if it was answerable.
Benchmark Approach We consider three SOTA approaches as our benchmark approaches to understand the characteristics of existing datasets: Edit-SQL (Zhang et al., 2019), IGSQL (Cai and Wan, 2020), and RAT-CON. RAT-CON is our extension of RAT-SQL (Wang et al., 2020a), which is the SOTA approach for the context-independent Text-to-SQL problem. Appendix A.1 provides the details of our extension. All of the three approaches utilize BERT (Devlin et al., 2019) for encodings.

Contextual Dependency
Prior work (Bertomeu et al., 2006) on database question answering dialogues reveals that questions within a dialogue tend to be context-dependent, i.e., the meaning of a question cannot be understood without its context. The last two questions in Figure 1 are typical context-dependent questions, requiring resolutions of ellipsis. In fact, how to effectively model the context to understand a contextdependent question is one of the major challenges in XDTS (Liu et al., 2020). Hence, we study this characteristic of existing datasets to understand how pragmatic and challenging they are.
To measure the contextual dependency of an XDTS dataset, we manually classify all the questions in its development set into contextdependent and context-independent. If a question is context-dependent, we further label whether it has coreference or ellipsis, which are two frequently observed linguistic phenomena in dialogues (Androutsopoulos et al., 1995). Note that a question  can have both coreference and ellipsis. Each question is first classified by one author of this paper, and then cross-checked and corrected by another. As shown in Table 1, there are only a limited number of context-dependent questions in existing datasets. Specifically, only 32% of questions in CoSQL are context-dependent, and the remaining 68% questions can be understood without the context. Among the 293 question sequences in the development set of CoSQL, 34% of them do not have any context-dependent question. Table 15 in Appendix provides a set of CoSQL question sequences and our classification results. Compared with CoSQL, SParC has more context-dependent questions and more questions that require resolutions of coreference and ellipsis. Nevertheless, 48% of its questions are still context-independent. Table 2 shows the Question Match (QM) and Interaction Match (IM) of our benchmark approaches on SParC and CoSQL. The QM on context-dependent questions is substantially lower than that on context-independent ones, showing that it is challenging for SOTA approaches to generate SQL queries for context-dependent questions. In view of this challenge and the limited number of context-dependent questions in existing datasets, it is necessary to construct a more pragmatic dataset, involving more context-dependent questions, for studying XDTS.

SQL Hardness Distribution
SQL hardness is defined as a four-level complexity for SQL queries: easy, medium, hard, and extra hard, according to the number of components, selections, and conditions in a SQL query (Yu et al., 2018). The more components a SQL query has, the more complex it is. Intuitively, the more hard and extra hard SQL queries a dataset has, the more challenging the dataset is. Table 3 presents the SQL hardness distribution in the development set of SParC and CoSQL. We can observe a biased distribution in both datasets, i.e., more than 40% of SQL queries are easy. This biased distribution is potentially caused by their construction methods. Take SParC as an example. A question sequence is constructed by decomposing a complex SQL query into multiple thematically related ones. Although this method is costeffective, there is little chance that a SQL query is more complicated than the one that it is decomposed from. As we will show in Section 4.3, the SQL hardness distribution of question sequences created from scratch differs a lot from those created via decomposition.

Dataset Construction
Given the limitations of existing datasets, we present CHASE, a large-scale and pragmatic Chinese dataset for XDTS. Unlike the construction of SParC and CoSQL, we do not specify a final goal for each question sequence. Instead, we motivate our annotators to raise diverse and coherent questions via an intent recommendation method. Based on this method, we collect a set of relational databases, and we recruit annotators to create question sequences from scratch and annotate corresponding SQL queries. Data collected in this way are referred as CHASE-C.
Besides, inspired by the construction of CSpider (Min et al., 2019) and Vietnamese Spider (Tuan Nguyen et al., 2020), we translate all the questions, SQL queries, and databases in SParC from English to Chinese. During translation, we also try out best to mitigate the biases in SParC. Data collected with this method are referred as CHASE-T. CHASE is make up of both CHASE-C and CHASE-T.
Since all existing datasets for XDTS are constructed for English, prior work on this problem primarily focuses on English, leaving other languages underexplored. To enrich the language diversity, in this paper, we construct CHASE for Chinese, and we leave the support of more languages as our important future work.

Intent Recommendation
In XDTS, the intent of a question q i j is fully reflected by its SQL query y i j . Hence, by defining a rich set of relations between y i j−1 and y i j , we can derive diverse y i j based on y i j−1 . Consequently, we can motivate annotators to raise questions with diverse intents. We define four basic intent categories of relations between y i j−1 and y i j : (1) Same Instances. y i j focuses on the other properties of the instances queried in y i j−1 , e.g., by replacing columns in the SELECT clause of y i j−1 .
(2) Different Instances of the Same Entity. y i j queries the same type of entity and properties as in y i j−1 , but it focuses on different instances, e.g., by adding an extra condition in the WHERE clause.
(3) Different Entity. y i j queries a different type of entity than y i j−1 , e.g., by altering the tables in the FROM clause of y i j−1 . (4) Display. y i j alters the way to display the information queried in y i j−1 , e.g., by adding an ORDER BY clause or DISTINCT in the SELECT clause.
We define 16 relations in these four categories, and we also allow combinations of them. Due to the limit of space, we only present 8 relations with their examples in Table 4. Complete relations are available in Table 12 of Appendix.
When an annotator is going to raise a follow-up question, one of the five intent categories in Table 4 will be randomly selected. The annotator is then recommended to choose a relation belonging to the selected category and raise the question according to the relation. Also, the annotator is allowed to change the intent category when it is not applicable or she has a better choice. With this intent recommendation method, follow-up questions will be closely related to their previous questions and present rich intent diversity. R4. Overlap select name from student join student course where course name = "Python"; select name from student join student course where course name = "C++"; Different Entity R5. Change Entity select name from student; select course name from course; Display R6. Add Order select country, count(*) from student group by country; select country, count(*) from student group by country order by count(*); R7. Distinct select country from student; select distinct country from student; Combination R8. Add Property (R1) & Subset (R3) select name from student; select name, age from student where country = "US"; Table 4: A subset of relations between precedent SQL query y i j−1 and current SQL query y i j .

Construction of CHASE-C
Data in CHASE-C are collected in three stages: (1) database collection; (2) question sequence creation; and (3) data review.

Database Collection
We collect 120 Chinese multi-table relational databases from the DuSQL dataset (Wang et al., 2020c). There are 200 databases and 813 tables in DuSQL, but most of the tables are crawled from encyclopedias and forums. Hence, there are a lot of missing entries and noises (e.g., duplicated or conflicted columns, tables in a database describing unrelated topics, and missing foreign key constraints).
To obtain high-quality databases, we manually revise all the databases, dropping those without related tables, resolving duplicated or conflicted columns, and complementing missing entries. As a result, we collect 120 high-quality databases, covering 60 different domains such as Sport, Education, and Entertainment.

Question Sequence Creation
We recruit 12 Chinese college students that are skilled at SQL to create question sequences for databases from scratch. They are also asked to write the SQL query for each question. When a student starts a question sequence creation session, she is shown all the contents from a database, and she can get familiar with the database by executing arbitrary SQL queries. Once she gets ready, she will receive a specification of the minimum number of questions in the sequence. 2 She can raise the first question with her interests. Take the creation of question sequence in Figure 1 as an example.
The student asks the first question "哪所大学培养 了最多MVP球员？" and writes its corresponding SQL query. The execution results of the SQL query will be shown to the student, helping her raise the follow-up question. After that, she receives the intent category Different Instances of the Same Entity, which is randomly sampled by our annotation tool. 3 She chooses the Overlap relation in this category and raises the second question "状元呢？". This creation session continues until the minimum number of questions is reached.
To help study the characteristics of questions and address the schema linking challenge (Guo et al., 2019b;Lei et al., 2020) in Text-to-SQL, we also ask the students to label each question's contextual dependency as in Section 2.3 and the linking between database schema items (tables and columns in databases) and their mentions in questions.

Data Review
To ensure the data quality, we conduct two rounds of data review. First, when a student creates her first 20 question sequences, we carefully review all the annotations to check whether the questions in each sequence are thematically related and whether the semantics of SQL queries match their questions. If not, we run a new round of training for the student. Through this round of review, we can resolve misunderstandings of annotations as early as possible. After the finish of the question sequence creation stage, we review all the question sequences like in the first round, and we ask the students to modify their annotations if there are any problems.

Construction of CHASE-T
The original SParC dataset consists of 4,298 question sequences and 200 databases, but only 3,456 and 160 of them are publicly available for training and development. Hence, we could only translate those to construct CHASE-T. The translation work is performed by 11 college students, 10 of whom also participate in the question sequence creation stage of CHASE-C. Each database and all its question sequences are translated by one student. The student also needs to label each question's contextual dependency and the linking between schema items and their mentions in the translated questions. We encourage the student to translate a question based on its semantics to obtain the most natural question in Chinese.
To mitigate the biases in SParC, we ask our students to modify those context-independent or thematically unrelated questions and SQL queries to make the question sequences more coherent and natural. Our intent recommendation method is also applied to guide the modification. To ensure the data quality, we also run a two-round data review as in Section 3.2.3.
During the construction of CHASE-T, we identified and fixed 150 incorrect SQL queries in SParC. 4 Also, we modified 1,470 SQL queries to make the question sequences in CHASE-T more coherent.

Data Statistics and Analysis
We compute the statistics of CHASE and conduct a thorough analysis to understand its three characteristics: contextual dependency, SQL hardness distribution, and mention of database schema items.  questions with their corresponding SQL queries annotated) over 280 databases. CHASE-C contributes 37% question sequences and 43% question-SQL pairs; CHASE-T takes the rest part. CHASE is the largest dataset for XDTS to date, consisting of the most question sequences, SQL queries, and databases. CHASE also has rich semantic annotations, including contextual dependency and schema linking, which can inspire innovations to address challenges in XDTS. Table 16 in Appendix provides a list of question sequences in CHASE.

Data Statistics
Data Split According to the cross-database setting of XDTS, we split CHASE such that a database appears in only one of the train, development, and test set. To understand the characteristics of the data collected in CHASE-C and CHASE-T, we also split them accordingly. Since CHASE-T is constructed from SParC, we follow the train and development split of the original SParC dataset. Table 6 shows the data split statistics.    Table 3 shows the SQL hardness distribution of CHASE. SQL queries in different hardness levels are more evenly distributed in CHASE, and only 28% of them are easy. By comparing CHASE-C with existing datasets, we can observe a remarkable difference between their hardness distributions. Specifically, the number of easy queries (19%) in CHASE-C is less than that of hard (24%) and extra hard (20%) queries, indicating that question sequences created from scratch with our method are much more challenging. In terms of CHASE-T, the number of easy queries decreases from 40% to 37% through our effort, compared with SParC.

Mention of Database Schema Items
To understand how database schema items (tables and columns) are mentioned in questions, for each item annotated in the schema linking, we examine whether or not it can exactly match its mention in the question (Suhr et al., 2020). As shown in Table 7, among the 26,464 items annotated in the schema linking of CHASE, 48% of them are exactly mentioned in questions (Exact String Match), and 40% of them have at least one token that appears in their mentions (Fuzzy String Match). The remaining 12% items cannot be matched with their mentions via any string-match based methods (Semantic Match). Table 8 presents four typical examples for fuzzy string match and semantic match. Compared with CHASE-T, whose data are constructed from SParC, CHASE-C has more items in the fuzzy string match and semantic match groups, implying that CHASE-C is more challenging and its mentions of schema items are more diverse.

Experiments
To understand the performance of the SOTA approaches on CHASE, CHASE-C, and CHASE-T, we experiment with the three approaches introduced in Section 2.2. Appendix A.3 provides the details of our adaptations for Chinese inputs and the experimental setup. Table 9 presents the experimental results, from which we make four main observations. First, the performance of the SOTA approaches on CHASE is far from satisfactory. The best approach on CHASE, IGSQL, only achieves 40.4% Question Match (QM), which is significantly lower than the SOTA QM on SParC (60.1%) and CoSQL (50.8%). In terms of Interaction Match (IM), the best approach on CHASE only achieves 15.6%, lagging behind the SOTA IM on SParC (38.1%) and CoSQL (20.1%) by a large margin. 5 These results show that CHASE presents significant challenges for future research on XDTS.

Experimental Results
Second, the performance of the SOTA approaches on CHASE-C is lower than that on CHASE-T. Specifically, IGSQL can achieve 43.3% QM and 26.3% IM on CHASE-T, but only 32.6% QM and 9.3% IM on CHASE-C. It shows that question sequences created from scratch with our method is much more challenging, which is consistent with our analysis in Section 4.
Third, the performance of the SOTA approaches on CHASE-T is lower than that on SParC. There are two reasons for the degradation. First, during the construction of CHASE-T, we try our best to mitigate the two biases found in Section 2,   q 1 哪所大学培养了最多MVP球员？(Which university has the most MVP players?) y 1 select t2.college from MVP Record as t1 join player as t2 group by t2.college order by count(distinct t2.player id) desc limit 1 y 1 select college from player group by college order by count(*) desc limit 1 q 2 状元呢？(How about the first overall pick?) y 2 select college from player where is first pick = "yes" group by college order by count(*) desc limit 1 y 2 select is first pick from player group by college order by count(*) desc limit 1 q 3 居然还是肯塔基！杜克也非常出名啊，它培养了多少呢？ (Still Kentucky! Duke is also very famous! How many does it have?) y 3 select count(*) from player where is first pick = "yes" and college like "%duke%" y 3 select count(*) from player where college like "%duke%" Table 11: Predictionsŷ j of IGSQL for the question sequence in Figure 1. SQL queries are translated to English.
which makes CHASE-T more pragmatic and challenging than SParC. Second, existing approaches for XDTS are tuned for English only, and some components of these approaches cannot process Chinese inputs as well as English inputs. Finally, although RAT-CON achieves the SOTA performance on SParC and CoSQL, it lags behind EditSQL and IGSQL by a large margin on CHASE and CHASE-C. Through a careful examination, we find that RAT-SQL (Wang et al., 2020a), the model that RAT-CON builds upon, adopts a string-match based method to find the linking between database schema items and their mentions in questions. However, this string-match based method struggles when many schema items are not exactly mentioned in questions. Also, this method struggles in Chinese probably because it is only tuned for English. The annotations of schema linking in CHASE can provide a great opportunity for future research to tackle this problem. Table 10 shows the QM of IGSQL on the development set of CHASE, stratified by contextual de-pendency, SQL hardness, and question position. 6 We can observe a remarkable discrepancy between QM on context-independent and context-dependent questions. To tackle this problem, more advanced context modeling methods are needed. Our annotations of contextual dependency in CHASE can enable a fine-grained analysis on XDTS approaches, and they potentially can be used to address this problem. Besides, we observe that the QM of IGSQL on medium, hard, and extra hard queries of CHASE is higher than that of CHASE-C and CHASE-T, implying that more training samples for these complex queries can improve an approach's performance on them. A similar observation can be obtained in the question position. The QM of IGSQL on questions in turn 4 and >=5 is higher than that of CHASE-C and CHASE-T. Table 11 shows the predictions of IGSQL for the question sequence shown in Figure 1. q 1 queries the players that have won MVP, but IGSQL misses the "MVP Record" table, probably because the FROM clause of SQL is synthesized based on the other predicted clauses. q 2 requires a resolution of ellipsis. It queries the college with the most first pick players, but IGSQL fails to resolve the ellipsis and predicts the wrong column in the SELECT clause. The last question omits the object "first pick players" of the verb "have", but the approach cannot fully resolve it and misses the first pick constraint in the WHERE clause.

Related Work
Dataset XDTS is a sub-task of contextdependent semantic parsing (CDSP) (Suhr et al., 2018;Guo et al., 2019a;Li et al., 2020). Many datasets have been constructed for CDSP. They can be categorized into two groups according to their annotations.
(1) Denotation Utterances in this group of datasets are only labelled with their denotations, i.e., the execution results of logical forms. SEQUEN-TIALQA (Iyyer et al., 2017), SCONE (Long et al., 2016), and CSQA (Saha et al., 2018) are representative datasets in this group. SEQUENTIALQA was constructed by decomposing some complicated questions from WikiTableQuestions (Pasupat and Liang, 2015) into sequences of simple questions. A question sequence in SCONE was collected by randomly generating a sequence of world states and asking annotators to write an utterance between each pair of successive states. CSQA was constructed by collecting a large number of individual questions and converting them into question sequences via a set of manually crafted templates.
(2) Logical Form Utterances in this group are labelled with their logical forms. Except for SParC and CoSQL, ATIS (Hemphill et al., 1990;Dahl et al., 1994) and TEMPSTRUCTURE (Chen and Bunescu, 2019) also fall into this group. ATIS was constructed under the Wizard-of-Oz (WOZ) setup. An annotator raised a question, and another annotator wrote the corresponding SQL query. Unlike datasets for XDTS, ATIS only focuses on the flight planning domain, which limits the possible SQL logic it contains. TEMPSTRUCTURE was also constructed under the WOZ setup, but it synthesized many artificial question sequences with templates to enlarge the dataset.
CHASE belongs to the group of logical form. To the best of our knowledge, it is the largest dataset with logical forms annotated for CDSP. Also, CHASE is the first Chinese dataset for CDSP.
Approach A lot of approaches have been proposed to address XDTS (Zhang et al., 2019;Cai and Wan, 2020;Zhong et al., 2020;Hui et al., 2021;Yu et al., 2021). Zhang et al. (2019) proposed Ed-itSQL, which generates a SQL query by editing the query generated for previous turns. EditSQL also uses an interaction-level encoder (Suhr et al., 2018) to model the interactions between the current question and previous questions. IGSQL (Cai and Wan, 2020) improves over EditSQL by introducing a graph encoder to model database schema items together with historically mentioned items. Hui et al. (2021) jointly modeled a question sequence, schema items, and their interactions via a dynamic graph and a graph encoder. They also proposed a reranking module to improve the generation accuracy. Liu et al. (2020) systematically compared different context modeling methods on SParC and CoSQL. They found that concatenating all questions as inputs rivals or even outperforms more complicated context modeling methods. This finding also motivates us to implement the strong benchmark approach, RAT-CON.

Conclusion and Future Work
This work presents CHASE, to date the largest dataset for XDTS, consisting of 5,459 question sequences over 280 databases. Each question in CHASE has rich semantic annotations, including its SQL query, contextual dependency, and schema linking. Experimental results show that CHASE highlights the challenging problems of XDTS and there is a long way for us to achieve real Textto-SQL demands of users. Currently, CHASE is constructed for Chinese. We plan to support more languages in the future. Besides, we plan to explore the ways to utilize the rich semantic annotations in CHASE to address the challenges in XDTS.

Ethical Considerations
This work presents CHASE, a free and open dataset for the research community to study the crossdatabase context-dependent Text-to-SQL problem (XDTS). Data in CHASE are collected from two sources. First, we collect 120 databases from the DuSQL (Wang et al., 2020c) dataset, a free and open dataset for the Chinese Text-to-SQL problem. To collect question sequences on these 120 databases, we recruit 12 Chinese college students (5 females and 7 males). Each student is paid 10 yuan ($1.6 USD) for creating each question sequence. This compensation is determined according to prior work on similar dataset construction (Yu et al., 2019a). Since all question sequences are collected against open-access databases, there is no privacy issue. Second, to enlarge our dataset, we translate all the data, including questions, SQL queries, and databases, from English to Chinese in SParC (Yu et al., 2019b). SParC is a free and open English dataset for XDTS. 11 college students (5 females and 6 males) are recruited to perform the translation, each of whom is paid 2 yuan ($0.3 USD) for translating each question. The details of our data collection and characteristics are introduced in Section 3 and 4.

A Appendix
A.1 RAT-CON RAT-CON is our extension of RAT-SQL (Wang et al., 2020a), the SOTA approach for the contextindependent Text-to-SQL problem. Given a question q and a database DB, RAT-SQL first links the database schema items with their mentions in questions via a string-match based method. Then, the linking results are jointly encoded with q and DB using a relation-aware self-attention transformer (Shaw et al., 2018). To generate a SQL query y, RAT-SQL adopts a grammar-based decoder (Yin and Neubig, 2017).
To extend RAT-SQL to the context-dependent setting, we use the simple concatenation context modeling method, which has shown to be competitive with other more complex context modeling methods (Liu et al., 2020). Specifically, to generate SQL query y i j for q i j , we concatenate all its prior questions q i 1 , · · · , q i j−1 with a special symbol The other components of RAT-SQL remain the same. Figure 2 shows the architecture of RAT-CON with an illustrative example.
We implement RAT-CON on the codebase of DuoRAT (Scholak et al., 2021). We use the default hyper-parameters in DuoRAT except for the batch size, which is altered to 24.
A.2 Annotation Tool for CHASE-C Figure 3 shows the user interface of our annotation tool for collecting question sequences in CHASE-C. When an annotator is going to raise a follow-up question, an intent category is randomly sampled from one of the five categories in Table 4. The chosen category is highlighted in the row "意图" of the left panel. The annotator is recommended to raise a question that meets one of the relations in the category. After raising the question, the annotator is asked to label the contextual dependency and the corresponding SQL query of the question. The SQL query should be executable in the SQLite database engine. The execution results are shown to the annotator. Besides, we extract all the tables, columns, and values in the query, and we ask the annotator to link them to their mentions in the question. The linked characters are highlighted in the row "Tokens" of the left panel.

A.3 Experimental Details
To study existing datasets for XDTS, we need to get the predictions on the development sets from the benchmark approaches. The predictions of Ed-itSQL are released with its source code. Hence, we directly use them for analysis. As for IGSQL, we train it with the default hyper-parameters specified in its source code, but we cannot reproduce the numbers reported in its paper. Nevertheless, IGSQL still outperforms EditSQL in both SParC and CoSQL. In terms of RAT-CON, we train it from scratch. All our experiments were conducted on TITAN RTX with 24GB memory.

R1
Add Property select name from student; select name, age from student; R2 Remove Property select name, age from student; select name from student;

R3
Replace Property select name from student; select country from student;

R4
Add Group select count(*) from student; select country, count(*) from student group by country;

R5
Add Aggregation select name from student; select count(*) from student R6 Alter Aggregation select max(age) from student; select avg(age) from student; R7 Delete Aggregation select count(*) from student; select name from student; Different Instances of the Same Entity R8 Subset select name from student; select name from student where country = "US";

R9
Superset select name from student where country = "US"; select name from student where country = "US" or country = "China"; R10 Disjoint select name from student where country = "US"; select name from student where country = "China"; R11 Complement select name from student where country = "US"; select name from student where country != "US"; R12 Overlap select name from student join student course where course name = "Python"; select name from student join student course where course name = "C++"; Different Entity R13 Change Entity select name from student; select course name from course; Display R14 Add Order select country, count(*) from student group by country; select country, count(*) from student group by country order by count(*); R15 Alter Order select country, count(*) from student group by country order by count(*) asc; select country, count(*) from student group by country order by count(*) desc; R16 Distinct select country from student; select distinct country from student;   What unique cities are in Asian countries? Independent y 3 1 select distinct t3.name from country as t1 join countrylanguage as t2 join city as t3 where t1.continent = "Asia" q 3 2 Which of those cities have a population over 200,000? Dependent (Coreference) y 3 2 select distinct t3.name from country as t1 join countrylanguage as t2 join city as t3 where t1.continent = "Asia" and t3.population >200000 q 3 3 What is the average population of all cities in China? Independent y 3 3 select avg(t3.population) from country as t1 join countrylanguage as t2 join city as t3 where t1.name = "China" q 3 4 What is the average population of all cities that speak the Dutch language? Independent y 3 4 select avg(t3.population) from country as t1 join countrylanguage as t2 join city as t3 where t2.language = "Dutch" Table 15: Question sequence examples in CoSQL. Since CoSQL is a task-oriented dialogue corpus, it has some questions involving clarification, e.g., the second question q 2