The StatCan Dialogue Dataset: Retrieving Data Tables through Conversations with Genuine Intents

We introduce the StatCan Dialogue Dataset consisting of 19,379 conversation turns between agents working at Statistics Canada and online users looking for published data tables. The conversations stem from genuine intents, are held in English or French, and lead to agents retrieving one of over 5000 complex data tables. Based on this dataset, we propose two tasks: (1) automatic retrieval of relevant tables based on a on-going conversation, and (2) automatic generation of appropriate agent responses at each turn. We investigate the difficulty of each task by establishing strong baselines. Our experiments on a temporal data split reveal that all models struggle to generalize to future conversations, as we observe a significant drop in performance across both tasks when we move from the validation to the test set. In addition, we find that response generation models struggle to decide when to return a table. Considering that the tasks pose significant challenges to existing models, we encourage the community to develop models for our task, which can be directly used to help knowledge workers find relevant tables for live chat users.


Introduction
One of the longstanding goals in Natural Language Processing (NLP) is to develop conversational agents that assist people with concrete tasks, such as finding information in large collections of documents or booking restaurants and hotels. To aid the development of such virtual assistants, the research community is in need of benchmarks that reflect the intents and linguistic phenomena found in real-world applications. However, developing such real-world conversational datasets is challenging in the current research landscape. On the one hand, academic labs often struggle to come up with  natural use cases of task-oriented dialogue agents and collect conversations with a large number of real users. Many labs have designed artificial tasks and collected conversations from crowd workers with simulated intents (Budzianowski et al., 2018;Adlakha et al., 2022;Lee et al., 2022), often leading to datasets that do not capture the linguistic challenges of production settings (de Vries et al., 2020). On the other hand, industry labs might have access to users with genuine intents (e.g., through Siri or Alexa) but rarely release such conversational  Table 2: Comparison with related datasets (see Section 2). (*) CoQA uses rationales to support extracted answers.
datasets due to their commercial value and user privacy concerns. Hence, we argue that the research community would benefit from a task-oriented dialogue environment where findings can be validated with real users, and, to that effect, present a unique dataset in collaboration with Statistics Canada. Statistics Canada (StatCan) is a national statistics agency commissioned with collecting key information on Canada's economy, society, and environment. Statistics Canada conducts hundreds of surveys on virtually all aspects of Canadian life and publishes the resulting data tables on statcan.gc.ca. This website currently features 5K+ of such complex and often large data tables. Canadian citizensand other interested individuals-come to this website to find the statistics they are looking for. The StatCan website offers a chat functionality (available in English and French) to help users in case they can not find the appropriate information.
Sourcing from these live chats, we present the StatCan Dialogue Dataset, a collection of 20K+ English and French conversations between visitors of statcan.gc.ca and agents working at Statistics Canada. Before releasing this dataset, StatCan has ran several procedures to remove Personally Identifiable Information (PII). While we observe a wide variety of user intents, ranging from table manipulation to navigation instructions, a large number of visitors use the chat functionality to find data tables on the StatCan website. Specifically, we observe 6.6K instances where agent returns a link to a data table across 4.4K conversations. In Table 1, we provide an example conversation in which an online user is looking for specific data tables.
In this work, we develop two novel tasks centered on helping users find specific tables. First, we introduce the table retrieval task, which requires a model to predict the table returned by the agent given the messages sent so far. Second, we introduce the response generation task, which requires a model to predict the agent's response given the dialogue history. For both tasks, we investigate its difficulty by establishing strong baselines and evaluating them on various metrics.
We stress that both tasks are immediately useful in a real-world setting. The table retrieval task can help agents find relevant tables faster while the response generation task may lead to a virtual agent that can return relevant tables through an online conversation. We hope that this tight connection with a real-world scenario will bring the research community more insight into the challenges of developing practical dialogue agents and lead to faster transfer of research ideas and findings.

Related Work
This section presents various directions related to our work. See Table 2 for a comparative summary.
Open-domain QA This is the task of answering questions using a large and diverse collection of text documents. One of the first large-scale evaluations in open-domain QA was presented at TREC-8 (Voorhees, 2001). Since then, many studies have released large-scale open-domain QA datasets: WikiQA (Yang et al., 2015) and MS MARCO (Bajaj et al., 2018) source questions from the Bing search engine, Natural Questions (NQ) (Kwiatkowski et al., 2019) from Google search, and DuReader (He et al., 2018) source questions in Chinese from Baidu. The questions come from real users and the answers are collected from the search results through crowd workers. Although those datasets have questions with genuine intent and the answer must be retrieved from a collection of documents, our dataset emphasizes the retrieval of tables (in a conversational setting) rather than free-form documents.  Table 3: Statistics of English conversations and tables in the retrieval and generation tasks. New tables are calculated with respect to training set (see Table 11).
Chen et al. (2020) and Herzig et al. (2021) respectively propose OTT-QA and TAPAS-NQ, two novel approaches that extend open-domain QA to retrieving tables instead of documents. The former collects both the questions and answers from crowd workers and the latter extends Natural Questions by using tables from the article where the answer was taken. In both cases, the tables being retrieved are sourced from Wikipedia articles. Although our data also incorporate tabular retrieval, the tables are sourced from statcan.gc.ca, they can be significantly larger (as discussed in Appendix A.2.1), and they are being retrieved in an interactive and conversational setting.  specific goals, such as booking a restaurant or resolving a customer issue. While early work has collected a dataset in a genuine information seeking setup (Hemphill et al., 1990), many recent datasets has collected them through a simulated setup (Budzianowski et al., 2018;Rastogi et al., 2020;Feng et al., 2020Feng et al., , 2021Chen et al., 2021;Lee et al., 2022). Taskoriented models usually track the dialogue state by predicting dialogue acts that are specified through intents and slot-value pairs, e.g., findRestaurants(cuisine=Italian). While our dataset does not provide turn-based annotations, the released conversations come with an annotated goal i.e., which data table the user was looking for. Like other goal-oriented dialogue tasks, this annotation enables us to automatically evaluate the dialogue models through a task completion metric.
Chit-chat Dialogue The goal for chit-chat systems is to engage in a open-ended conversation with an end-user (Lowe et al., 2015;Dinan et al., 2018). Unlike our dataset, such conversations do not intend to assist the user with a specific task.

Dataset
The  User intents The live chat was designed to fulfill specific user intents. The main intent of the chat functionality is to help users find specific data tables. For example, in Table 1, the agent helps the user find tables about building permit, gasoline price, and retail prices for food. Although, users might also be interested in obtaining metainformation, receive help in manipulating a table or with the user interface. In some cases, the user will make out of domain requests. Those auxiliary intents are described in Appendix A.1 since the focus of this work is on the main intent.
Messages and turns Each conversation is broken down in turns, which is a pair of user-agent responses. Each response can have multiple messages sent sequentially (e.g., in Table 1, the first agent response contains 3 back-to-back messages).

Basic statistics
In total, 25397 conversations will be made available. Based on our main intent, we focus on a subset of 4468 (3675 in English and 793 in French).
Out of a total of 5907 tables available in both English and French, the agents returned 959 unique tables in English 285 in French. The number of messages by conversation varies between 2 and 78 with a median of 12 for the English split (see Figure 1 for the distribution). Based on Table 3  per conversation, with over 30 tokens for each message. This indicates that the speakers will express multiple sequential thoughts before the addressees respond. For the French split, we analyzed the basic statistics in Appendix A.5.
Frequently requested tables In total, 6 tables make up 13.4% of tables retrieved, covering subjects like inflation and household spending. Supplementary details can be found in Appendix A.3.1.

Dialogue Analysis
We categorize 100 turns (306 messages) from 24 conversations in the English validation set according to the speech acts defined by Bunt et al. (2010Bunt et al. ( , 2020, which is also known as ISO standard 24617-2. We follow their taxonomy but merge some finegrained acts with their broader concepts (e.g., correction, agreement, disagreement with inform). We present the speech act frequencies and examples in Table 4. See the Appendix for more information on how we merged the original acts and supplementary examples (Table 15 and Table 22, respectively). We notice that answers appear twice as frequent as information seeking questions because an interlocutor may provide an answer to both clarification questions and requests. Additionally, inform acts appears 63% of the time because agents need to expand upon their answer and users tend to clarify their initial requests by informing the other. Although less frequent, auto feedback and time management are still relevant because interlocutors cannot rely on visual feedback like nodding or typing. Naturally, time management often co-occurs with promises because the agent tends to put the user on hold while promising to fulfill their request.

Table specifications
To explain the specifications, we examine a sample table with title Production and value of maple products (shown in Table 5  for geography, the members are the provinces producing maple syrup (Quebec, Ontario, etc.), and for Maple products the members are the production types (maple sugar, syrup, taffy and butter). A member item generalizes the concept of rows and columns as they are interchangeable via pivoting. Sometimes, details about those members are provided as footnotes at the end the page.
Basic Information This is the core metadata and consists of the title, dimensions, subject, survey and update frequency (member items are excluded). Supplementary details are in Appendix A.2.2.
Hierarchical relation The metadata can be viewed hierarchically. As shown in Figure 2, each subject encompasses different surveys, each survey can be used to generate one or more tables, and so on. A member item that can be nested under another member item is called Level.

Dataset splits
We group the conversations into a train (70%), a validation (15%) and a test (15%) set. The test set was specifically selected to be the most recent conversations by date (covering Sept 8, 2020 to Mar 8, 2021), whereas the training and validation set were randomly selected from the remaining data (covering Mar 1, 2019 to Sept 8, 2020). This lets us test a model's capability to adapt to temporal shifts in the data (such as new data releases and novel events). This is useful to understand a model's capability to generalize beyond the training distribution, but it is also a better reflection of real-world applications of a model (which will be used for future data). The same splits are used for all tasks.

Tasks
Based on the conversational and tabular data, we propose two tasks: (i) a table retrieval task, which requires a model to use a partial conversation to predict the table an agent will return, and (ii) a response generation task, which requires a model to use a partial conversation to generate the most probable response by the agent. The conversations in the tasks are available in both English and French.

Retrieval task
For this task, we truncate every conversation right before a link to a relevant table is shared by the agent. As a result, the product ID (PID) corresponding to that link becomes the objective of the retrieval task, as shown in Table 6. When the agent shares multiple non-repeating PIDs within a conversation, each unique occurrence is treated as a separate sample.
Recall@k To evaluate models for retrieval, we compute the recall at k (R@k) score for k ∈ {1, 10, 20}, which corresponds to the rate where the correct table is among the k tables retrieved by the model (usually ranked by a relevance score). We choose k = 1 for real-time automatic retrieval and k ∈ {10, 20} for scenarios where humans or automatic rerankers would like to use the retriever to query tables and select the best option.

Response generation task
In the first task, only the messages leading to a table retrieval are considered. For this task, each message sent by an agent is considered as a target and everything before is the source. Thus, the goal of this task is to use the source text to generate a response that matches the target (see Table 7). Since dialogue responses are challenging to evaluate, we report a wide variety of metrics for this task. ROUGE-L (Lin, 2004;Lin and Och, 2004) is a common text evaluation metric which naturally takes into account sentence level structure by identifying the longest overlapping word sequence between two sentences. METEOR (Banerjee and Lavie, 2005) is a wordlevel precision and recall scoring method that encompasses different ways to represent a word, including stems and synonyms.

ROUGE-L and METEOR
BERTScore and MoverScore Various methods were developed to leverage contextual embeddings  Title accuracy In addition to the general metrics for text generation, we also explored this taskspecific metric. We define it as the proportion of generated messages that contain the title of a table shared in the reference messages. Consequently, this metric only includes turns where a table is shared by an agent. To compute this, we (i) find the product ID in the reference message, (ii) look up the title, (iii) check if that title appears exactly in the generated and reference text.

Models
To help understand the performance of finetuned models on our tasks, this section presents an overview of the methods for the English splits, whereas the implementation details are covered in Appendix A.4. Similar architectures were used for French (described in Appendix A.5).   allows T5 to decide between using one of the suggested titles and generating something else (e.g., clarification question). This is similar to the agents' behavior, as they tend to return a title with the URL when sharing a relevant table. Furthermore, supervising T5 to ignore or return a title is equivalent to an implicit binary classification.

Results and Discussions
Based on our baselines and data, we report the results and analyze the challenges that our dataset and tasks pose for existing models. For the English splits, the main retrieval results are reported in Table 9 and Table 8, and main generation results are in Table 10. Full results can be found in Appendix A, respectively in Table 16 and Table 33, and relevant statistical tests in Appendix A.7.

Impact of table representation
In Table 9, we observe that the metadata representation affects the retrieval recall. Although DPR can achieves respectable results when it only retrieves the title, including basic information (defined in Section 3.3) yields slight improvements, and further adding member items results in a significant difference from only using title (p = 0.014). However, only using member item result in drastic decrease in recall (p = 0.00086), indicating the importance of the title. Moreover, footnotes do not yield any improvement, which may be because they often exceed the maximum context span (see Table 20). Thus, concisely but meaningfully representing metadata will be crucial to achieve a good recall on the retrieval task.  Transfer to table retrieval task Our experiments allow us to analyze the effectiveness of open-domain QA fine-tuning (NQ) and tabular pretraining when transferring to our table retrieval task. We observe in Table 9 that DPR outperforms TAPAS and TAPAS-NQ by respectively 23.0% and 15.1% in test recall@10. Moreover, TAPAS-NQ achieves a better performance when it only retrieves the title and member items instead of the full table (p = 0.016), likely due to repetitions and truncation due to context size limits. Although both DPR and TAPAS-NQ were trained on NQ, the latter was trained on a small subset (12K vs 320K) that contains tables. Our experiments indicate that TAPAS transfers poorly from one task (NQ-Tables) to another (StatCan).

Response generation
We compare the performance of fine-tuned T5-large models with and without DPR-augmented table titles. In Table 10, we notice that retrieval-augmented models show modest improvements on the 4 non-task specific metrics. However, the top-5 augmented model achieves an absolute improvement of 3.86% in title accuracy, indicating that the information provided by DPR does help T5 in generating the desired title. It is nevertheless surprising that T5 without augmentation achieves a score of 6.96%, suggesting that the T5 model is capable of storing the titles seen at training time, and, to a limited extent, is able to recall and return them at test time. Finally, we point out that the title accuracy is still 5.91% lower than top-1 recall of the DPR retriever (  Figure 3: Results for DPR B+M for tables appearing in the training set frequently (10+ times), infrequently (1-10 times), and unseen at train time.
reliably reply with the title of a common table, struggles for uncommon ones, and is sometimes capable of generating unseen titles with the help of DPR. The full analysis is in Appendix A.6.
Temporal drifts As explained in Section 3.4, we use a temporal split to test the model's ability to generalize to future conversations. We observe a significant drop in recall (13%-28.3%) in Figure 4 when we compare the validation and test set performance, even when the models are trained with varying metadata representations. Similarly, T5 achieves low scores on the test split for the response generation task (Table 10). This large gap suggests that trained models struggle to generalize to future conversations. First, we found that this is likely caused by the number of new tables that appear in the test split (145) compared to validation (41), as shown in Table 3. Moreover, the subjects of the conversations have significantly changed: users started to care more about businesses, health and IT, and less about demography, income and pensions. This is likely motivated by real-world events affecting the users, which are more difficult to implicitly capture from simulated environments, but desirable in order to understand a model's robustness in temporal shift and for real world applications. In the Appendix, Figure 8 displays the differences between the training and test splits for all subjects.
Generalizing to unseen tables As shown in Figure 3, DPR performs well for tables appearing frequently in the validation split, but poorly in the test split, which could be caused by temporal drift. As expected, tables that were not seen during training resulted in poor recall@5 in either splits. Moreover, the difference in recall between valid and test for infrequent tables could be caused by many potential reasons (learning bias, temporal overfitting,  Figure 4: Validation and test recall for a selected set of retrieval models. We observe a significant drop in performance. B+M denotes Basic + member. spurious correlation with hidden factors). Thus, future models should aim to close the gap between unseen and frequent tables and within the temporal spectrum of infrequent tables.
French results In both tasks, we see a drop across all metrics for all models, likely due to the smaller dataset size. Some observations remain valid: temporal drift, poor BM25 performance, and augmentations benefit mT5 for certain metrics. However, others differ: adding member items hurts test results and mT5 performs poorly on title accuracy. Modeling details and results can be found in the Appendix A.5.

Conclusion
In this paper, we introduce the StatCan Dialogue Dataset, a novel corpus consisting of 20K+ English and French conversations between online visitors of statcan.gc.ca and operators of Statistics Canada. Based on this dataset, we propose two tasks centered on helping users find specific data tables: the table retrieval task and the response generation task. For the table retrieval task, we experiment with various DPR and TAPAS variants, finding that DPR strongly outperforms its TAPAS counterpart, as well as the BM25 baseline. For the response generation task, we investigate fine-tuned T5-large models and explore variants where the input is augmented with table titles from DPR. We find that retrieval-augmented T5 models more frequently return the correct tables, although its title accuracy is still lower than the corresponding recall of the DPR retriever. This result suggests that the generation models struggle to decide when to return a table. We also find that retrieval and generation have difficulty generalizing to future conversations, as our temporal test split revealed a big performance gap between the validation and test set. All in all, we believe that our tasks pose significant challenges to currently available models and encourage the research community to further explore this dataset and build conversational models that help users of Statistics Canada.

Limitations
Tasks and models limitations The tables in the retrieval task are sourced from statcan.gc.ca, which means that the content is primarily about Canadian demographics 2 and are professionally edited by StatCan employees. Moreover, the generation task is specifically designed to model responses with high fidelity based on retrieved tables, so this task should not be directly used in an unintended or non-research setting (e.g., deploying a virtual assistant) as they pose risks of hallucination that could negatively impact stakeholders. Furthermore, those limitations can be reflected in the models we trained, so we will share those limitations in the model cards (Mitchell et al., 2019) on release.
Environment impact We acknowledge the models in Section 5 used hardware with significant energy consumption. We purposefully chose models of reasonable sizes that can be reproduced on one GPU. Additionally, our hardware is powered by renewable energy.

Artifacts and computational experiments
We trained models using libraries based on their intended use and we will release the relevant artifacts following the original licenses. The computational details of the experiments are described in Appendix A.4.

Ethics Statement
Privacy and data access As discussed in Section 1, significant efforts were made to remove Personally Identifiable Information (PII). However, we do not rule out the possibility that certain details could have been missed in that process. Thus, any user that wishes to use the data will need to authenticate and accept the terms of use through an institutional data repository; the terms will require the user to report any instance of PII leak, which will be removed with a dataset update. Additionally, we request any derivative or modifications Help with user interface Some users are looking for help with the user interface. For example, one user wanted to download a specific data table but they were unable to find the download link.
Out of domain requests We find many conversations that are outside of the scope of StatCan's live chat. For example, some user asked what documentation needs to be provided to ship a specific product to a foreign country.
The first intent is covered in Table 1, and subsequent intents are in Table 18. While we believe all intents are interesting directions for dialogue research, we focus on the table retrieval intent because (i) there are many conversations available for them and (ii) there is a clear measure of task success i.e., whether the correct table is retrieved. Throughout the rest of this paper, we work with conversations where the agent returns a table URL.

A.2.1 Formatting and size
Full tables are stored in a long format, where each column corresponds to a dimension, except the last one which corresponds to the value. The number of rows corresponds to the Cartesian product of the dimensions' members. This means that for n i members across d dimensions, we have a total of d i=0 n i rows in the full table. When a table has many member items, the full table can become extremely large. Moreover, tables shown on the statcan.gc.ca are usually a pivoted and filtered view of the full table, which means certain members will become columns, others will become rows, and many are simply omitted.

A.2.2 Detailed Specifications
This section provides supplementary details for Section 3.3.
Product ID (PID) Unique 8 to 10-digit identifier given to each published data table. Although other types of tables might be shared by the agent, the PID will always be given for the official data tables; as a result, any table that does not have a PID in the URL is not considered for this task. The first two digits (1-2) represent the code of the subject associated with the table (this can be found in the basic information), then 3-4 represent the product type, which in our case are tables and are common coded as "10". Digits 4-8 is a unique identifier representing that table for the given subject. Digits 9-10 indicate the view of the table and will commonly be "01", which represents the default view; they are needed when constructing the URL but they are otherwise optional, and are omitted in the released dataset.
Member items Labels for individual tables, and can either be represented as a column or a row index through a pivot operation. Each table will have member items different from other tables.
Dimensions Non-overlapping sets of member items; each member item must belong to a dimension. For a table with d dimensions, each data value is associated with a single d-tuple in the dfold Cartesian product of the dimensions, but not every tuple in the product will have an associated data value (if it was not recorded or if the tuple is invalid).
Basic information Metadatum consisting of the title, the date range, the frequency, the dimensions, the subject category, and the survey 3 from which the data was sourced.
Footnotes Unstructured comments often included with a table if supplementary details need to be given; usually, those notes are associated with a specific member item that requires more explanations.
Full table For all tables with a PID, the full tables (as a CSV file), their complete metadata, and their basic information are made available as part of the data release and can be used in our proposed tasks. A sample table can be found in Section 3.3.

A.2.3 Table updates and archives
Tables on Statcan will sometimes be updated regularly, whereas in other times they may only be released once. When they stop being updated and the information becomes outdated, they will be marked as archived. In some cases, a new version of an archive table may be created with substantial changes (such as new columns). As shown in Table 14, tables are released without a predetermined schedule, but has been on average increasing since 2000, with major peaks in 2017, which was caused by many health-related tables, and 2021, which was caused by the release of many labour, science, and income tables, as shown in Figure 6. Moreover, Figure 5 shows that a majority of tables released between 2019 and 2021 are still up-to-date (current), whereas most of the tables before then have been archived.

A.3.1 Frequently requested tables
The most frequent tables are summarized in Table 21. Whereas 2 of them are sourced from the consumer price index survey (commonly used to track inflation), the 4 other tables cover more general and broad subjects like income, demography, business performance and crime. Each of those 4 tables are sourced from different surveys. One table is updated monthly, another semi-annually, and the rest are updated annually. The oldest table was updated in 2019, which is when the conversations started being recorded.

A.3.2 Table Frequency Statistics
In tables 3 and 12, we can calculate that a table is returned on average 5.68 times (with standard deviation of 12.86) in English conversations and 4.25 times (standard deviation of 9.73) in French conversations. Thus, the most requested tables are disproportionately represented compared to less popular tables, and there's a very high variance in the number of time a table is used. In fact, there are 294 tables that appear only once in either splits (i.e., 28.9%).

A.3.3 Fine-grained Conversation Statistics
In Table 13, we observe that the number of messages and turns will vary significantly around the mean, with over 68% conversations lasting between 2 and 7 turns. In extreme cases, a conversation can last up to 28 turns. Moreover, we also notice that, although most messages will have around 32 tokens, the longest message can have up to 1374 tokens; in those scenarios, we will see agents write a large body of text, and sometimes also copy and paste large amount of text (for example, from a database of templates) when responding to the user. Although those are usually sent in multiple consecutive messages within a turn, they may decide to send everything all at once.

A.4 Implementation Details
This section provides the details for implementing the models in Section 5. DPR and TAPAS-NQ We used the base variant of DPR and the large variant of TAPAS-NQ. We use the DPR checkpoints that were trained on 320K questions from Natural Questions (Kwiatkowski et al., 2019) (NQ). During training, the networks were optimized with AdamW (Loshchilov and Hutter, 2017) at a learning rate of 10 −5 and zero weight decay. Based on the original work, the networks were trained for 30 epochs, with a batch size of 64 queries, positive passages, and hard negative passages (the latter are retrieved with BM25). Negative in-batch sampling was used to increase negative examples. To ensure reproducibility, the networks were trained on a single 32GB GPU and used gradient checkpointing (Chen et al., 2016) to reduce memory usage. The conversation lengths was 512 tokens, and the metadata token lengths were 128 for title, 256 for basic information (defined in Section 3.3), and 512 for the rest.

Implementing transformer models
T5 We used an Adafactor optimizer (Shazeer and Stern, 2018) with a learning rate of 0.001. We used batch sizes of 16 with 8 steps of gradient accumulation and gradient checkpointing to reproduce the batch size of 128 samples in the original implementation. The models were trained on a single 32GB GPU for 10 epochs. The source and target lengths were respectively 512 and 256 tokens, where the source was truncated from the right to ensure that the latest messages remained after truncation. We used a beam size of 4 and length penalty of 0.6 following the original implementation.
Training time For the English split, each variant of DPR can be trained in 68 minutes on a A100 GPU. Each large variant of TAPAS-NQ takes 15h to train on a V100 GPU. Each variant of T5 can be trained in 16h on a V100 GPU. All DPR results can be reproduced in 8h, TAPAS in 90h, and T5 in 96h, and proportionally less time would be needed for the French split.

A.5 Modeling the French subsets
Basic statistics The number of messages by conversation varies between 2 and 59 with a median of 11 for the English split (see Figure 7 for the distribution). Based on Table 12, there's on average 3.9 turns but 12.3 messages. On average, there are over 30 tokens for each message (using the T5 tokenizer).
Language splitting In order to determine the language of each conversation, we used two popular language identification libraries: langid.py (Lui and Baldwin, 2011) and a fasttext network finetuned for language detection (Joulin et al., 2017). After apply the models on every conversation, we only retain the conversations with matching language labels (both English or both French).

Training and evaluation
The training procedure and evaluation on the French subsets follow exactly the tasks specified in Section 4.
Modeling response generation Instead of T5, we used the multilingual T5 model by Xue et al. (2021) as it naturally handles text in French.
Modeling retrieval We used a variant of DPR derived from CamemBERT (Martin et al., 2020) and trained on three French Q&A datasets (Keraron et al., 2020;d'Hoffschmidt et al., 2020;Kabbadj, 2021) by Etalab Lab-IA (2021). Table 17, we observe that, unlike the English split, adding member items to the basic information or to the title improves validation results but not test results, which likely indicates overfitting. However, we notice a high variance between the runs, which makes it difficult to determine whether member items is helpful. Both overfitting and high variance are likely caused by the smaller size of the training set. Moreover, BM25 perform extremely poorly on any metadata view, which can also be linked to the dataset size. Table 33, we notice a significant decrease across all metrics, with the title accuracy being consistently 0%. This is likely because the French split is significantly smaller, yet remains as complex as the English split, which becomes challenging for mT5 to model. In the case of title accuracy, we found 55 instances in the French test split where the title is in the target text (i.e., returned by an agent). However, in 54 cases, the augmented mT5 returned a generic reply (e.g., "Veuillez patienter pendant que j'effectue une recherche.") instead of the expected title, which indicates that mT5 is incapable of determining when it is relevant to return a title and can't generate non-templated responses.

A.6 Responses generated by T5
In this section, we select a few conversations from the validation set and examine the messages generated by T5 and T5 augmented with DPR-retrieved titles (T5+D).
Common and uncommon responses In Table 23, we notice that both T5 and T5+D are capable of generating common speech acts like "Thank you" and "Please wait...", but struggles when faced with an unfamiliar situation (having to ask for clarification for a user that has been accidentally disconnected). In Table 24, among the retrieved tables (R i ) titles, the first one was partially correct. Both T5 and T5+D extended the title and also output the desired ID, matching the expected agent's response, which is one of the most popular table in the training set (see Table 21).

Common table
Multiple tables, date selection In Table 28, we notice that T5 only returns one of the two tables that the agent returned. On the other hand, the correct tables were retrieved by DPR, but T5+D failed to select the ones with the correct dates (it selected June 2019 instead of December 2019) but the select were otherwise relevant.
Verbosity of explanations T5+D additionally provided a paragraph of explanation while linking to relevant resources, both on the StatCan website (non-tabular) and external resources. This is because T5+D memorized this information during training, and simply replaced the tables with the updated dates (Table 27). In the conversations shown  in Tables 25, 26, 29, the retrieved table appears  10 times in the training set, which is significantly  less common than the table retrieved in Table 24. For Table 25, the table returned by the agent was not retrieved by DPR, leading to T5+D returning the first title retrieved. However, in Table 26, the correct title was retrieved by DPR (title #4), yet T5+D failed to use that correct title in the generated message. As for Table 29, the agent gave a hint by stating "As a standard product, we have tables about employment by industry", which was correctly acknowledged by DPR as the second retrieved title perfectly matches the PID of the table in the agent's response. However, T5+D fails again at selecting the correct title, instead opting to return a generic response ("Please hold while I find the information"), and T5 hallucinates a PID that is different from the title it generated (both of which are wrong).

Uncommon tables
Tables unseen during training Among the tables that do not appear in the training set (see Table 11 for more information), there are seven that appear 3 or more times in the validation or test sets (Table 32). In Table 30, we see a conversation where DPR retrieves the correct title, which is correctly returned by T5+D, whereas T5 fails to return it. On the other hand, when DPR also correctly retrieves the title in Table 31, T5+D fails to return it, as it was likely mislead by the agent saying "Unfortunately,...".

A.7 Statistics Tests
To back the claims in Section 6, we performed multiple single-tailed Welch t-tests, using the mean and corrected standard deviation from 16. The null hypotheses are that means of experiments A are different from the means of experiments B, across 3 runs. Unless otherwise specified, we use R@1 on the test split.  Figure 4, the validation recall@10 are higher than the test split for TAPAS, DPR Title and DPR Basic + member. With A being the validation score and B the test scores, the p-values are respectively 0.00197, 2.18 × 10 −5 , 0.00014.

B Dataset Card
This section presents a dataset card that follows the format proposed by Lhoest et al. (2021), which was inspired by Mitchell et al. (2019) and Gebru et al. (2021).
Summary The StatCan Dialogue Dataset consists of over 20K+ conversations between agents working at Statistics Canada (StatCan) and users who are visiting StatCan's website and need support via the official live chat system.
Tasks A subset of 19K conversation turns is used to build two tasks: 1. Automatic retrieval of relevant tables based on a on-going conversation. For each partial conversation, the task is to return the ID of the most likely table returned by an agent. This is evaluated using the recall@k metric.
2. Automatic generation of appropriate agent responses at each turn. For each partial conversation, the task is to return the most likely response by an agent, including link to a relevant table. This is evaluated using four metrics described in Section 4.2.
Leaderboard The leaderboard and submission instructions can be found on the project webpage. Each submission will be accompanied with a tag indicating if: • It was self-reported; • The submissions were externally evaluated; • The inference was reproduced following provided instructions; • The complete training process was independently reproduced.
Languages The conversations were held in Canadian English (en-CA) and Canadian French (fr-CA).

B.1 Dataset Structure
B.1.1 Data Instances Conversation A full example of a conversation can be found in Table 19. Instances for each user intent can be found in Table 18, and two conversations with annotated dialogue acts can be found in Table 22. For our case study in A.6, we show partial conversations in Tables 24, 25 , 26, 27, 28, 29, 30, 31. Tables The complete metadata of a table can  be found in Table 20, which can be access at doi.org/10.25318/3210035401-eng. Table 21 shows the basic information for the most popular tables.

B.1.2 Data Fields
Full dataset A CSV file with the following fields is provided: • conversation: The partial conversation (before a table is returned) in JSON format.
• conversation_index: A unique index that serves at identifying the conversation outside of this task.
• conversation_processed: The conversation converted into a readable text format, with extra information (such as timestamp) removed, the URLs replaced with a special tag, and separation tags (</s>) added.
Retrieval task CSV files with the following fields is provided for each split: • conversation • conversation_index • conversation_processed • target_pid: The product ID of the table that is returned by the agent • language: The language reported by the live chat system, which may not always be accurate due to mislabeling.

Metadata
The metadata that was used during retrieval is provided as a CSV file with the following fields (one for each of the 5907 tables): • pid: The product ID of the table • full_info: The basic information, member items and footnotes all in a single entry • x_and_y: An combination of two items above, for example x=title and y=footnotes.
• * _fr: All of the above are also available in French, indicated by the suffix _fr.
Generation task CSV files with the following fields is provided for each split: • source: Equivalent to conversation.
• target: The message written by the agent following the conversation.
• target_processed: The message written by the agent following the conversation, with URLs replaced with a special tag.
• conversation_index An augmented variant of each CSV file for the conversation task is provided with all of the above as well as the following fields: • source_augmented: The same content as conversation, appended with the title of the top-5 tables retrieved by best DPR variant trained on the basic information.
• target_augmented_1: The same content as source_augmented, but using only the first table instead of top-5.
Conversation JSON formatting A conversation follows the following JSON format:

B.1.3 Data Split
The retrieval splits has the following number of samples: • Train: 3782 (en); 869 (fr) • Validation: 799 (en); 201 (fr) • Test: 870 (en); 141 (fr) They correspond to the number of queries in Table 3 and Table 12 because each query results in a table being retrieved, which can happen more than once in a conversation. In such cases, the partial conversations will be truncated at different turns in the conversation.
The generation splits have the following number of samples: • Train: 21582 (en); 3977 (fr) • Validation: 4464 (en); 861 (fr) • Test: 4850 (en); 884 (fr) There are fewer samples than the number of messages in Table 3 and Table 12 because the goal of the task is only to predict the messages that will be written by the agent. Just like the retrieval task, the partial conversations will be truncated at different turns in the conversation.

B.2.1 Curation Rationale
Section 1 extensively motivates the curation of the dataset. To summarize, we enumerate the major points: • Data from real users: We wanted a dataset that captures the linguistic challenges that exist in the real world • Task-oriented dialogue: We wanted tasks with the specific goal of helping live chat users in their search of statistics.
• Real-world applications: Our model can be directly applied other statistics offices that want to set up a chat system, and our dataset will be useful for any organization that has (1) a chat system, and (2) a database of tables.
• Multilingual dialogue: We wanted to build models that can handle languages beyond English, which is why we also offer French versions of our tasks.

B.2.2 Source Data
Conversations The data was retrieved from the live chat system on statcan.gc.ca, which was anonymized by the development team at StatCan. The conversations happened between March 2019 and March 2021.
Tables The tables are publicly available and were downloaded following the instructions in the Web Data Service user guide: statcan.gc.ca/en/developers/wds/user-guide. The tables were released by Statistics Canada between 2000 and 2021. The data are either collected directly by Statistics Canada (e.g., through a census or a survey) or were compiled from existing sources (such as private sector organizations and government agencies) into official statistics. Existing sources include: • Administrative data: Collected by government or the private sector as part of ongoing operations, and include records of birth and death, taxes, border control, and satellite data.
• Microdata Linkage: Existing information is linked to create new data. The existing information may not always be available publicly (for privacy purposes), thus linkage could add new information that was previously unavailable, while protecting the confidentiality of the public.
• Open data: Machine-readable and freely available data sourced from various channels (e.g., OpenStreetMap).
• Web scraping: Data from the internet that were scraped by Statistics Canada (this excludes personal information and "any information that will not be used to produce statistical output").

B.2.3 Annotations
The dataset does not contain any additional annotations beyond the ones collected through the live chat system and included in the table metadata.

B.2.4 Personal and Sensitive Information
Personal and sensitive information were removed programmatically from the conversations, and officially published tables only contain aggregated information that preserve the confidentiality of the participants. Although the removal process is highly advanced, there is a non-zero chance that some information can be used to reconstruct the profile of a user. For this reason, the access to the data will require researchers to sign-up and agree to the terms of use, and any derivative must be shared on the same platform and include the same terms.

B.3.1 Social Impact of Dataset
The purpose of this dataset is develop and evaluate models that can assist knowledge workers in finding relevant tables from a data source. By providing a specialized retrieval system capable of returning more relevant results compared to general purpose, the productivity of the knowledge workers can be increased. For public agencies and statistics offices, this would benefit many live chat users interested in statistics related to a certain community.

B.3.2 Discussion of Biases
As discussed in Section 7, there are always risks of toxicity in online discourses, which means that the live users may exhibit negative biases in their messages. However, the StatCan agents are tasked to communicate with online users in a professional manner. Thus, researchers should not use this dataset to build models that generate messages written by live users, and any model trained on the dataset should not be used in scenarios where biases can negatively impact stakeholders.

B.3.3 Other Known Limitations
Section 7 describes other known limitations.

B.4.1 Dataset Curators
The dataset was curated by the authors of this paper based on the original data collected and processed by StatCan developers and agents.

B.4.2 Licensing Information
The conversations use a custom license, which needs to be accepted by researchers interested in accessing the conversation. The tables are released under the the Statistics Canada Open Licence: statcan.gc.ca/en/reference/licence.      0  2001  39  0  2002  14  5  2003  8  0  2004  31  2  2005  1  0  2006  18  0  2007  142  11  2008  85  7  2009  72  5  2010  243  8  2011  45  3  2012  248  8  2013  49  40  2014  142  145  2015  247  264  2016  52  134  2017  710  136  2018  148  122  2019  27  203  2020  43  640  2021 52 1633     it is also possible to download the monthly data in .CSV form, add together the months in a spreadsheet to get annual totals A: or if you'd prefer, you can make a custom request to our analysts (a fee may apply).Please send us your request, clearly indicating the variables, geographic areas, time frame and data frequency (i.e., annually, monthly or quarterly) needed at infostats@canada.ca (mailto:infostats@canada.ca) and refer to Case number: 986005 so we can access the transcript of this live chat. Your request must include your name, address, postal code, and phone number (please do not share your personal information on the live chat.)An agent will get in touch with you to provide an estimate and proceed with the payment, if applicable.You may also submit your request by telephone at 1-800-263-1136. U: thats fine! thank you very much :) U: very helpful U: :)! U: have a good day! A: glad I could help! you too Help with user interface U: Hello, Do you know where I can download the dataset for General Social Survey Cycle 32? A: Please wait while I try to find this information for you. U: Thank you U: (I'm referring to this one: https://www23.statcan.gc.ca/imdb/p2SV.pl?... ) A: General Social Survey: Caregiving and Care Receiving, Public Use Microdata File (https://www150.statcan.gc.ca/...) U: I'm sorry, but im having trouble finding the download link for cycle 32 (2018) A: you have to click on 'More information' and that will bring you to the order form U: ohhhh, I see U: thank you for letting me know U: thank you for your help! A: you're welcome -happy to help! Out of domain requests U: I'm wondering how to obtain a copy of an autopsy that's been done for a family member. Doctor office says they don't get the autopsy report. A: Statistics Canada does not issue certificates for birth, death or marriage. Civil registration is a provincial jurisdiction.
Therefore, you must contact the government ministry, registry or vital statistics office of the province or territory in which the birth, death or marriage occurred. To obtain the contact information for your provincial or territorial government, please call Service Canada at 1-800-O-Canada (1-800-622-6232) or visit the following page: Vital Statistics contacts (https://www.statcan.gc.ca/eng/health...) . A: also: Coroner's report (https://spvm.qc.ca/en/...)    I was talking with Sylvain, will you be able to see the log?
A: Yes, please hold while I verify. T5: Yes, please hold while I verify. T5+D: Yes, please hold while I verify. A: Just to clarify, you're looking for all-items ? T5: You can obtain this information at the following hyperlink : Table: 18-10-0004-13 Consumer Price Index by product group, monthly, percentage change, not seasonally adjusted, Canada, provinces, Whitehorse, Yellowknife and Iqaluit ( ) T5+D: You can obtain this information at the following hyperlink : Table: 18-10-0004-01 Consumer Price Index by product group, monthly, [...] and Iqaluit ( ) Table 23: Conversation #4890. Each message is separated by a horizontal line. The original conversation, in bold, only contains messages by the user (U) and agent (A). The generated responses by T5 and T5 with top-5 DPR-retrieved titles (T5+D) predict the corresponding agent's message (in bold).
[...] U: Ideally, I want the percentage change for these items R1: Consumer Price Index, monthly, not seasonally adjusted R2: Monthly adjusted price index, provisional R3: Consumer Price Index, monthly, seasonally adjusted R4: Historical (real-time) releases of the industrial product price index, by major product group, monthly, inactive R5: Consumer Price Index (CPI) statistics, alternative measures, unadjusted and seasonally adjusted, Bank of Canada

A:
You can obtain this information at the following hyperlink : Table: 18-10-0004-13 Consumer Price Index by product group, monthly, percentage change, not seasonally adjusted, Canada, provinces, Whitehorse, Yellowknife and Iqaluit (https://www150.statcan.gc.ca/t1/tbl1/...) T5: You can obtain this information at the following hyperlink : Table: 18-10-0004-13 Consumer Price Index [...] and Iqaluit ( ) T5+D: You can obtain this information at the following hyperlink : Table: 18-10-0004-13 Consumer Price Index [...] and Iqaluit ( )  How can I find information relating to Kelowna population estimates, demography, population projection estimates, population by age and sex A: hello how may i help you today? U: How can I find information relating to Kelowna population estimates, demography, population projection estimates, population by age and sex A: let me see what i can find on the subject for the projections information our tables only provide it to the provincial level it does not go lowe then that <extra_id_0> ( <extra_id_0> ) R1: Estimates and projections of population aged 0 to 29, by age group, Canada, provinces and territories R2: Projected population, by projection scenario, age and sex, as of July 1 R3: Population estimates on July 1st, by age and sex R4: Components of projected population growth, by projection scenario R5: Population estimates, quarterly A: https://www150.statcan.gc.ca/t1/tbl1/en/cv.action?pid=1710005801 T5: https://www150.statcan.gc.ca/n1/daily-quotidien/190619/dq190619f-cansim-eng.htm T5+D: https://www150.statcan.gc.ca/n1/pub/71-607-x/71-607-x2018005-eng.htm  Thank you for waiting. You will find the information you're looking for if you click the following hyperlink : Canadian Business Counts, with employees, June 2019 ( ) T5+D: Thank you for waiting. The industries are classified according to the North American Industry Classification System (NAICS). To find the NAICS code for an industry, please refer to this page (http://www23.statcan.gc.ca/imdb/...1181553). You may start your search by selecting one of the 20 sectors, until a six-digit NAICS is obtained. You will find the number of business locations by NAICS code in the following data tables: Table 33-10-0214-01 Canadian Business Counts, with employees, June 2019 ( ) Table  33-10-0215-01 Canadian Business Counts, without employees, June 2019 ( ) I also suggest consulting the financial data of small and medium enterprise in Canada; these organizations have annual earning revenues ranges of $30,000 to $20 million. This information is available by industry, province, territory and legal status of companies (incorporated or unincorporated). This data is not available on our website; however, it is freely available on Innovation, Science and Economic Development Canada's website, using their Financial Performance Data (http://www.ic