Database reasoning over text

Neural models have shown impressive performance gains in answering queries from natural language text. However, existing works are unable to support database queries, such as “List/Count all female athletes who were born in 20th century”, which require reasoning over sets of relevant facts with operations such as join, filtering and aggregation. We show that while state-of-the-art transformer models perform very well for small databases, they exhibit limitations in processing noisy data, numerical operations, and queries that aggregate facts. We propose a modular architecture to answer these database-style queries over multiple spans from text and aggregating these at scale. We evaluate the architecture using WikiNLDB, a novel dataset for exploring such queries. Our architecture scales to databases containing thousands of facts whereas contemporary models are limited by how many facts can be encoded. In direct comparison on small databases, our approach increases overall answer accuracy from 85% to 90%. On larger databases, our approach retains its accuracy whereas transformer baselines could not encode the context.


Introduction
Question answering (QA) over text has made significant strides in recent years owing to the availability of new datasets and models. Machines have surpassed human performance on the well-known SQUaD task (Rajpurkar et al., 2016) where models extract answer spans from a short passage of text. The subsequent body of work has further considered incorporating retrieval from large corpora such as Wikipedia (Dhingra et al., 2017;Joshi et al., 2017;Kwiatkowski et al., 2019) to identify relevant information, conditioning answer generation (Chen 1 https://github.com/facebookresearch/ NeuralDB  Lewis et al., 2020b;Izacard and Grave, 2020). More sophisticated architectures have been proposed with incremental retrieval for multi-hop QA Das et al., 2019), where several passages are required, which may have low lexical or semantic similarity with the question. This paper considers the problem of answering questions similar to database queries, such as those shown in Figure 1. For example, the query "List all the female athletes in Wikipedia who were born in the 20th century", requires reasoning over hundreds or thousands of facts, retrieved from multiple Wikipedia pages, and applying set-based filters to them (e.g., gender, birth date). If our query further asked how many such athletes exist, we would have to perform an aggregation function to count the result set. The ability to answer the aforementioned queries would enable a new kind of database (Thorne et al., 2021) where facts can be described in natural language and would therefore obviate the need for a pre-defined schema, which is a major limitation of current database systems. An example application for such flexible text databases exists in the area of storing knowledge for personal assistants where users store data about their habits and experiences, their friends and their preferences, for which designing a schema is impractical.
We introduce WIKINLDB, a benchmark dataset for exploring database reasoning over facts expressed in natural language. WIKINLDB contains a number of query types that require systems to return large set-based answers and aggregate over these (with operators such as count, min, and max). Our dataset is generated using publicly available knowledge graph data, enabling large volumes of instances to be generated with minimal effort. Most queries in WIKINLDB require reasoning over hundreds of facts to generate answers, exposing limitations in current neural models. In contrast to DROP (Dua et al., 2019) where queries are answered over single passages, and bAbI (Weston et al., 2015), where each query is based on a context of less than 20 facts, our dataset scales from databases of 25 instances to 1000, and could be extended further.
We also introduce a modular architecture to support database reasoning over text and characterize its behavior on our reference dataset. We find that even on small databases of 25 facts, naive application of transformers is insufficient. When provided with only the relevant facts, the baseline yields an answer accuracy of 85%, whereas applying our proposed architecture yields 90% by better answering queries, such as count, that require computation. It is well known that transformer models do not scale well to large inputs due to the use of selfattention. We found that mechanisms such as Fusion in Decoder (Izacard and Grave, 2020, FiD) and LongFormer (Beltagy et al., 2020), which mitigate the scaling issue, harm the model: combining more than 2 facts with FiD resulted in answer accuracies of 76% and 39%, respectively. These issues were mitigated by our approach which generates intermediate query-based derivations of small numbers of facts in the database, before using conventional computation to aggregate the results.
2 Answering Database Queries over Text

Problem Definition
We refer to corpora that consist of unordered collections of facts expressed as short natural language sentences as Natural Language Databases (NLDBs). For example, a corpus may include all the utterances given to a personal assistant by its user, or all the claims uttered by a political figure. The texts in our corpora are similar to databases as they are sets of stand-alone facts. But unlike a database, they are not expressed as rows or triples in a pre-defined schema. For example, a sentence containing a single fact, "Gustavo likes espresso" or multiple facts, such as "Robertson Howard, who attended the University of Virginia, is buried in the Congressional Cemetery".
A query Q over a database, D, produces a set of answers: Q(D) = {a 1 , . . . , a l }. We consider the following four query types (see examples in Table 5): (1) Set queries are extractive queries that return a list of spans, such as entities, from the facts.
(2) Boolean queries return a True/False answer.
(3) Aggregation queries require computation over answer sets with an operator, such as count, min and max. For example: "How many people work for Yale Law School?"). (4) Join queries require the combination of two (or more) facts to produce each answer. We combine join operations with set, Boolean and aggregation queries. For example, the query "Who works in a company in France?" considers both the relationship between people and employer as well as company locations.

Challenges
The NLP treatment of question answering, where systems encode the query and context (containing the background knowledge), forms a good starting point for NLDBs. Common model architectures are based on the transformer (Vaswani et al., 2017) in an encoder-decoder configuration. The encoder uses self-attention to conditionally encode the context with the query and the decoder allows conditional generation of outputs that are not necessarily present in the input. To scale question answering to reason over large knowledge-sources such as Wikipedia, task formulations typically retrieve textspans from a corpus to condition answer generation (Chen et al., 2017;Dhingra et al., 2017). However, several challenges encountered in NLDBs preclude direct application of these techniques:  Figure 2: Overview of the proposed architecture. Consisting of a support set generator, SPJ and aggregation Scale To scale neural reasoning to databases of non-trivial size, it would not be feasible to encode the entire database as input to the transformer. Question answering systems combine a retrieval mechanism to select relevant spans from knowledge sources as context. This task is usually referred to as open-domain QA (Lewis et al., 2020a;Izacard and Grave, 2020). It is common to use a maximum input size of 512 or 1024 tokens for context. While extensions such as Linformer , Longformer (Beltagy et al., 2020) and Fusion in Decoder (Izacard and Grave, 2020) enable larger contexts to be encoded, their application of self-attention varies and the number of tokens that may be encoded is limited by GPU memory.
Multiple answer spans The NLP formulation of question answering typically requires extracting a span from a single document or generating a short answer. Answering queries in a NLDB may require processing a large number of facts, generating a large number of items as answer, hundreds or thousands, and performing aggregations over large sets.
Locality and document structure NLDBs do not enjoy the locality properties that usually hold in open-domain QA. In NLDBs, a query may be dependent on multiple facts that can be anywhere in the database. In fact, by definition, the current facts in a database can be reordered and the query answers should not change. In contrast, in opendomain QA, the fact needed to answer a given question is typically located in a paragraph or document with multiple sentences about the same subject, in combination with a document title, where this additional context may help information recall.

Conditional retrieval
Similar to open-domain question answering, NLDBs mandate an information retrieval component. When determining which facts to input to the model, NLDBs may require conditional retrieval from the database. For example, to answer the query "Whose spouse is a doctor?" we'd first need to fetch spouses and then their professions. Recent work on multi-hop query answering (e.g., Asai et al. (2019)), has started considering this issue but is restricted to the case where we're looking for a single answer. In NLDBs, we may need to perform multi-hops for sets of facts.

Architecture for querying NLDBs
To address the aforementioned challenges, we propose an instance of a Neural Database architecture (Thorne et al., 2021) that operates over textual facts with parallelizable non-blocking operators before aggregating the results. The three core components of the architecture, shown in Figure 2, are a Support Set Generator (SSG) which retrieves small sets of relevant facts called support sets, a parallelizable non-blocking Select-Project-Join (SPJ) operator which generates intermediate answers that can be unioned to produce the final answer, and an optional aggregation stage which uses conventional computation to perform numerical reasoning. The key insight underlying our architecture is to leverage neural models for what they excel at, namely, reasoning over a small set of facts.
Neural SPJ Operator Given a single support set and a query, the SPJ (Select-Project-Join) operator outputs a machine readable intermediate representation of the answer that can be generated from the support set. For example, given the query "Who was born in Montevideo?" and the support set {"Mario Sagario was born in Montevideo, Uruguay, ..."}, the Neural SPJ would output the entity literal Mario Sagario. Examples of outputs are provided in Figure 3. The SPJ operator is performing three functions: (1) for support sets that are insufficient to answer a question, the operator should return no output; (2) for queries that require short chains of reasoning over multiple facts, the SPJ operator joins the facts when generating the output; and (3) the SPJ generates a projection of the support set to a machine readable format dependent on the given query, and whether computation or aggregation is required.
Because the SPJ operator is run in parallel, it can scale independently of the limitations on the size of the input of a single transformer. In contrast, the use of self-attention when encoding all facts as one input precludes parallelization, has high latency, and is limited by the memory required to compute the self-attention. By using the SPJ operator to perform query-dependent information extraction, aggregations can be performed over the generated outputs using conventional computation, which trivially scales to thousands of operands. Furthermore, this allows large result sets to be generated by the model, whereas accurately decoding long sequences using an encoder-decoder architecture remains an open challenge (Hupkes et al., 2020).

Support Set Generator (SSG)
A support set contains the minimal subset of sentences from the database needed to generate one single operand for the aggregation module by the SPJ operator. For example, for queries that are answered by a single sentence, e.g., "Who is Sheryl's husband?", the support set containing a single fact should be returned, e.g., {"Sheryl is Nicholas's spouse"}. The output of the support set generator is a set of support sets, each of which is fed independently to a downstream SPJ module. Support sets may not be pairwise disjoint because some facts may be required for multiple answers.
The SSG output should satisfy the following two properties: (1) If multiple facts are needed to produce an intermediate answer, they should all be in the support set. For example, if we queried "When was Sheryl's husband born?", the support set should include a fact stating who the spouse is and a fact describing when they were born. (2) When performing aggregation, or outputting a set of answers, multiple support sets must be generated, each containing enough information to generate the intermediate results that are aggregated. For example, for the query "Who is the oldest person?", each of the support sets would independently contain a fact that includes a person and indicates their age.
Aggregation The outputs of the SPJ modules are intermediate answers to the query. For some queries, e.g., "who lives in London?", the final answer is simply the union of the intermediate answers. In other cases, e.g., "how many countries grow coffee?", an aggregation operator needs to be applied to the union of intermediate answers.
Because output of the SPJ operators are machine readable, we can hence guarantee accuracy and scalability by performing aggregation using conventional computation. In this paper, we consider the aggregation functions min, max and count.

The WIKINLDB dataset
In this section we introduce WIKINLDB, a novel dataset for training NLDBs which is generated by transforming structured data from Wikidata (Vrandečić and Krötzsch, 2014) into natural language facts and queries. Wikidata stores triples of the form (S,R,O), where R is a relationship between the subject S and the object O, e.g., (Tim Cook, employedBy, Apple). The scale and breadth of Wikidata enables us to generate databases of many sizes and variety.
Facts To automate generation of questions and answers, sentences must be grounded in Wikidata identifiers. One approach to generate facts would be to use templates or collect them through grounded information extraction datasets such as T-REx (Elsahar et al., 2018). However, to ensure wider linguistic variety as well as accuracy of the mapping, we use verbalizations of knowledge graph triples that are synthesized through a sequence to sequence model. Concretely, we use generated sentences from KELM (Agarwal et al., 2020), which are not grounded with Wikidata IDs, and generate a post-hoc mapping back to Wikidata.For example, given the sentence: "The Slice of Life manga series The Film Lives On was written by Osamu Tezuka." we map it to the Wikidata triple (Q11332517,P50,Q193300). Our mapping is a two-step process: firstly, we look up entity names from Wikipedia, returning multiple matches for Osamu Tezuka, and secondly filter these based on which have an author relations to The Slice of Life in the Wikidata graph. While out of scope for this paper, this technique could be applied to generate training datasets for novel domains. WIKINLDB uses both atomic facts in KELM (about a single relation of an entity) or composite facts (about multiple relations).
Queries Following previous work on large-scale question answering (Hartmann et al., 2018;Talmor and Berant, 2018), queries are generated using templates. For each relation and operator, multiple templates were written by the authors where placeholders can be replaced with the subject and objects for each relation. While multiple templates are used to ensure variety, these are limited in diversity in comparison to the facts. Templates were generated for the first 25 relations on Wikidata with mapped data in KELM. To generate queries that require joins we apply the same technique, combining to combine two or more connected relations, chaining the entities. We further select the 15 most popular relations and generate additional templates which chain the two relations. For example, we chain (Y,locatedIn,Z) and (X,employedBy,Y) to create a template for the query "Does $X work at a company based in $Z?".

Data Quality
We manually inspect randomly selected queries and facts and score them using the categories introduced in this section. For queries, we sample 70 instances, 10 for each query type. We score each query for fluency and intelligibility. Out of 70 queries, only one question was marked as non-fluent due to a typo which was corrected for the final dataset. All 70 queries were intelligible. We observed that the clarity of some queries depended on the facts in the database to provide context (e.g. "Who is male?"), but otherwise met the task requirements.
To assess the quality of mapped facts from KELM, a sample of 50 was evaluated based on 6 categories: intelligibility, fluency, inclusivity (conveying information from all the mapped relations), faithfulness to these relations, and whether extraneous information (not in the mapped relations) is present. 49/50 facts were intelligible and 45/50 facts were fluent. The remaining 5 had redundant information or missing conjunctions. 50/50 facts contained all mapped relations and 48/50 were faithful to these relations. 8/50 facts had extraneous information for relations that could not be mapped. The relations that could not be mapped are not used for query generation and did not affect how answers were automatically generated.

WIKINLDB Statistics
We create databases over 25 common relationships from Wikidata, and create 643 templates from which queries are phrased. For join-type queries, we chain a fur-  25  8  4000  631  621  50  7  4986  498  499  100  13  2500  250  250  250  53  1000  100  100  500  66  500  50  50  1000  70  250 25 25  ther 15 relations with a further 86 template fragments. The relations we chose were selected from a weighted sample of the most common entity types in KELM. In total, we generate five variants of the dataset containing databases of size 25 to 1000 facts where each fact has between 30-50 tokens. Dataset statistics are reported in Table 1.

Neural Select-Project-Join
The SPJ operator is trained as a sequence-tosequence model to generate intermediate results from a support set and a given query. All facts in the support set are concatenated with the query before being input to a transformer model. The model is trained to output different derivations depending on the query type. For the min, max operators, the projection is a machinereadable key-value pair, illustrated in Figure 3. For example "which place has the highest yearly number of visitors?" has the projection of the form: (place, number of visitors) allowing an argmax operation by the downstream aggregation module. For queries with Boolean answers, the output is a token indicating whether the answer is true or false. And for all other queries where a set of results is returned or counted, the output is simply a span, such as an entity or numerical value, extracted from the support set.
Even though we use intermediary annotation for the SPJ operator, we believe that collecting such annotation is a simpler labeling task compare to collecting the answers to the queries. For example, given the fact "Serena Jameka Williams (born September 26, 1981) is an American professional tennis player and former world No." and the query "List all the female athletes who were born in 20th centure.", it seems relatively simple to provide the label "Serena Jameka Williams". However, it is non-trivial to produce a list of potentially hundreds of entities as answer (e.g. ["Serena Jameka Williams, Simona Halep, Mary Lou Retton, Megan Rapinoe, Kim Simmone, Mary Abichi, . . ."]). The training of the components in our proposed architecture does not depend on the final answer and instead, on the simpler intermediary labels.
Predicting Aggregation Operator Rather than using a separate classifier to predict the question type, we encode the choice of operator as a special token that is predicted by the SPJ operator prepended to the model output ( Figure 3). The aggregation operator is chosen using a majority vote over all generated derivations from all support sets.
Negative Example Generation It is important for the SPJ to be resilient to extraneous facts that might be returned by a low-precision high-recall SSG. Negative instances for training are generated in two ways: (1) queries are paired with randomly sampled facts and the model is trained to generate a NULL projection (indicating the support set does not contribute to the answer). For example, a fact about someone's date of birth isn't useful when answering a query about the visitor count of an attraction. (2) for a portion of the training instances, we additionally sample extraneous unrelated facts and append these to the support sets simulating false-positive facts from the SSG.

Support Set Generator
For simple queries over single facts, conventional information retrieval, such as TF·IDF could be considered a primitive SSG. However, this would not scale for joins, aggregation queries or for queries outputting a set of answers as generating relevant sets requires incremental decoding, conditioning on already retrieved facts.
Naively generating the set of all relevant support sets, SSG Q (D) ⊂ P(D), would be intractable as it is akin to enumerating the powerset. We construct support sets efficiently by taking an incremental approach, starting from the empty set (see Algorithm 1). At each step, the classifier considers the partially generated support setD k and the query and predicts which candidate facts u i ∈ D from the database should be added, or whether to stop the iteration, these choices being modeled as a multilabel classification task. If STOP is predicted, the partial result setD k is closed (i.e., it forms part of the output); otherwise, for each fact added, a new intermediate (open) support set is generated which is explored in the next iteration. For efficiency, we use a bi-encoder architecture that independently encodes the facts in the database and the state (query and a partial support set) and computes the inner product between the encoded representations to generate a score: C U (u i ) T C V (Q,D k ). The encoders are pre-trained transformers fine-tuned to yield a high inner product between the state's encodings and relevant facts to be added. At prediction time, the vectors encoding the facts are static and are pre-computed offline. At each step, t, we encode the state using a transformer by concatenating the query tokens and the facts in the partially generated support set D k . The SSG is trained with full supervision of all partial support sets from the dataset and trained to predict which facts to add to the support set using a contrastive loss.
Complexity of SSG The inner loop of Algorithm 1 involves a Maximum Inner Product Search (MIPS) between the encoded state and the encodings of the facts, which is linear in the number of facts. Approximate search, such as FAISS (Johnson et al., 2019), accelerate retrieval to O(log 2 n). If we assume a query needs a maximum of b support sets, and the average size of a support set is m, then the complexity of the SSG algorithm is O(bm log 2 n). Both b and m are bounded by the number of facts in the database n, but in practice we'd expect only one of b or m factors to be large. However, there is fertile ground for developing methods for indexing (and/or clustering) the facts in the database so that only few facts need to be considered in each iteration of the inner loop of the algorithm, leading to significant speedups.

Baselines
We compare our proposed architecture to transformer-based models that explore the effect of three attention mechanisms representative of the state-of-the-art. Self-attention in transformers captures both inter-fact as well as intra-fact interactions between tokens. However, computing self-attention is quadratic with respect to memory and scaling beyond 1024 tokens is non-trivial. In our baselines, the task formulation is a sequence to sequence model, similar to that used in question answering. All (relevant) facts are encoded with the query and the transformer is trained to predict the answer without using any intermediate representations. We compare full self-attention against independently encoding the facts (in the context of the query) and fusing the embeddings in the decoder (Izacard and Grave, 2020, Fusion in Decoder (FiD)). Because FiD independently encodes contexts, run-time complexity is reduced to be linear with respect to the number of facts at the expense of not having inter-fact attention. We additionally compare to using windowed attention over facts with global attention to the query using Longformer (Beltagy et al., 2020). Inter-fact attention is captured only within the window.

Implementation
We use the HuggingFace (Wolf et al., 2020) transformers library and its implementations of T5 and Longformer. For SSG, we use BERT to generate encodings, which has a comparable architecture to T5. The learning-rate for fine-tuning and number  of epochs were selected through maximizing the Exact-Match (EM) accuracy on a held-out validation set for the tasks. For each experiment, we train 3 separate models with different seeds and report mean accuracy. The SPJ models are only trained on the small database of 25 facts and applied to larger databases at test time.
For most queries, we measure correctness using Exact Match (EM), which is 1 if the answer string generated by the model is exactly equal to the reference answer and 0 otherwise. This metric is used to score outputs where either a Boolean, null answer, string or numeric answer is expected. When a set of results is returned, we compute the F 1 score considering exact matches of set elements. When comparing models and reporting results, we report macro-averages over all instances in the test set. We collectively refer to this as Answer Accuracy.

Experiments & Results
We first consider the suitability of transformer models over small databases of 25 facts comparing two information retrieval settings: PerfectIR, which is representative of other question answering approaches that combine an information retrieval system to select only the facts needed to answer a query, and WholeDB, where the entire database is encoded by the model, assessing resilience to unrelated information and noise.
The overall scores, in Table 2, indicate that without a retrieval mechanism (i.e., WholeDB), all models were susceptible to distractor facts. Furthermore, encoding all facts in a single model is not a viable solution to answer queries posed to NLDBs as this approach does not accurately answer queries that combine multiple support sets, illustrated in Figure 4, and cannot easily scale to thousands of facts. Using a transformer yields errors when the query requires computation, such as counting, highlighted when comparing rows 1 and 3 of Table 3. Inter-fact attention Applying FiD, which does not capture inter-fact attention, to scale to larger databases would not be successful because answer accuracy further decreases with with support set size. Applying Longformer, which captures interfact attention within a window could yield outcomes similar to the T5 transformer baseline where relevant facts are encoded with similar locality. However, in the limit, where context falls between different attention windows, the model could degrade to be similar to FiD.

Evaluating the SSG+SPJ architecture
Our architecture consists of a support set generator (SSG), a select-project-join (SPJ) operator that generates derivations over the support sets and an aggregation function over the results of the SPJ operators. Assuming a perfect SSG, the SPJ accurately answers more queries than the T5 transformer baseline (Table 2) because of the computation within the aggregation function that yields higher scores for min/max and count queries, displayed in Table 3. In combination with SSG, the overall score decreases to 67% due to retrieval errors. However, SSG+SPJ still exceeds the WholeDB baselines. It is tricky to evaluate the SSG in isolation because errors here not necessarily translate into errors in query answers. For example, the SSG may return a superset of a support set, but the SPJ may still generate the correct answer. Table 4 shows the performance of the SSG for a database of 25 facts. An output is considered an exact match if it is exactly the same as a support set in the reference data and soft match if it is a superset thereof.
Decoding machine-readable outputs The aggregation operator was selected by predicting a   special token decoded by the SPJ. For 1.4% of instances, an incorrect choice of aggregation function was made or the machine-readable outputs from the SPJ could not be parsed.

Scaling to larger databases
We scale the baseline transformers to larger databases using TF-IDF and DPR to retrieve appropriate facts. However, these models are still limited by the encoder size of the transformer. In contrast, the SPJ operates over support sets of 1-2 facts and, in combination with the SSG, can scale to arbitrarily large databases, illustrated in Figure 5. For Boolean queries, the combination of T5 and TF-IDF scored 89%, exceeding the accuracy of the SSG+SPJ. This is because TF-IDF exploits token matching between the query and facts. For larger databases, the retrieval errors resulted in lower answer accuracy. While, with a perfect SSG, the the SPJ accurately answers most query types, as database size increases, the propagation of errors from the SSG resulted in erroneous answers.

Related Work
Database queries require reasoning over a large set of relevant and non-redundant facts and performing aggregation. While in-roads have been made to perform discrete reasoning and computation over passages (Dua et al., 2019), with explicit computation (Andor et al., 2019) Figure 5: Scaling to larger databases with a model trained using 25 facts and tested on larger databases. , these use only a single passage rather than requiring aggregation over large numbers of facts from different texts. Multi-hop question answering requires finding supporting evidence in multiple documents (see (Welbl et al., 2018;Talmor and Berant, 2018;Wolfson et al., 2020) for datasets facilitating this research). In answering multi-hop questions, the works decompose the question into simpler sub questions (Min et al., 2019;Wolfson et al., 2020), or condition each hop on the previously retrieved documents (Asai et al., 2019;. While tasks such as ComplexWebQuestions (Talmor and Berant, 2018) and BREAK (Wolfson et al., 2020) focus on complex queries that can be broken down into simpler ones, our focus is on setbased and aggregation queries where the complexity comes from the need to retrieve and process a large number of non-redundant relevant facts. In contrast to the set and count tasks in bAbI (Weston et al., 2015), where each query is based on a small context (less than 20 facts), our dataset scales from databases of 25 facts to 1000.
Bridging the gap between unstructured natural language data and database-style querying has been a long-standing theme in database research (Halevy et al., 2003). The work on information extraction has developed techniques for translating segments of natural language text into triples that can be further processed by a database system. There has been significant work on translating queries posed in natural language into SQL queries on a database whose schema is known (Androutsopoulos et al., 1995;Li and Jagadish, 2014;Zeng et al., 2020), with extensions to semi-structured data and knowledge bases (Pasupat and Liang, 2015;Berant et al., 2013). More recently, systems such as BREAK (Wolfson et al., 2020) and ShARC (Saeidi et al., 2018) have trained models to translate a natural language query into a sequence of relational operators (or variants thereof).

Conclusions
Database systems are the workhorse of data analysis but they require a pre-defined schema. Part of their power stems from the fact that a data analyst can explore the data by easily posing a wide variety of queries. Given the rise in the amount of data that is becoming available in text, images and other modalities, we would like to build systems that enable the flexibility of posing complex queries against such data, but without the need for a pre-defined schema.
This paper proposed an architecture for neural databases and the associated WIKINLDB dataset, as first steps towards realizing a system for querying multi-modal data. Our architecture is capable of overcoming the limitations of transformer models because it runs multiple transformers in parallel, each taking a small set of facts. Consequently, NLDBs can scale to large databases.
Additional research is required in order to scale NLDBs to larger datasets, more complex queries, and to multi-modal data. In particular, one of the key components of the architecture is the SSG module that retrieves the relevant facts to feed to each instance of the neural SPJ. We believe that in practice, the semantics of the application will provide a strong hint on which facts may be relevant. For example, when querying a large corpus of socialmedia posts, each post is a candidate support set as long as the query does not require joining data from multiple posts. In addition, we assumed that our databases describe a snapshot of the world. In practice, we may have facts that override previous ones (e.g., 'Samantha works for Apple', followed by 'Samantha works for Twitter') and we would need to reason about which facts should be ignored.

Broader Impact Statement
Ethical Concerns A NL database is very similar to a traditional database in terms of applications with a difference that it extends the use of databases on unstructured text. For example, NL databases can be used to produce analytics on data expressed in natural language. For an NL database to be applicable in the context of a virtual assistance, they will likely need to be trained on real-world conversations. Privacy preserving ML methods should be considered for such applications.
Environmental Concerns Large transformerbased models take a lot of computational resources and energy for pre-training and fine-tuning. As a result such models raise environmental concerns. In our proposed architecture, we only fine-tune transformer models on small support sets. We then use several instances of such models in parallel for inference, instead of a single large model, even on large datasets. Therefore, the model is relatively efficient, both during the fine-tuning and during the inference.