The StatCan Dialogue Dataset: Retrieving Data Tables through Conversations with Genuine Intents

Xing Han Lu, Siva Reddy, Harm de Vries


Abstract
We introduce the StatCan Dialogue Dataset consisting of 19,379 conversation turns between agents working at Statistics Canada and online users looking for published data tables. The conversations stem from genuine intents, are held in English or French, and lead to agents retrieving one of over 5000 complex data tables. Based on this dataset, we propose two tasks: (1) automatic retrieval of relevant tables based on a on-going conversation, and (2) automatic generation of appropriate agent responses at each turn. We investigate the difficulty of each task by establishing strong baselines. Our experiments on a temporal data split reveal that all models struggle to generalize to future conversations, as we observe a significant drop in performance across both tasks when we move from the validation to the test set. In addition, we find that response generation models struggle to decide when to return a table. Considering that the tasks pose significant challenges to existing models, we encourage the community to develop models for our task, which can be directly used to help knowledge workers find relevant tables for live chat users.
Anthology ID:
2023.eacl-main.206
Volume:
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Andreas Vlachos, Isabelle Augenstein
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2799–2829
Language:
URL:
https://aclanthology.org/2023.eacl-main.206
DOI:
10.18653/v1/2023.eacl-main.206
Bibkey:
Cite (ACL):
Xing Han Lu, Siva Reddy, and Harm de Vries. 2023. The StatCan Dialogue Dataset: Retrieving Data Tables through Conversations with Genuine Intents. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2799–2829, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
The StatCan Dialogue Dataset: Retrieving Data Tables through Conversations with Genuine Intents (Lu et al., EACL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.eacl-main.206.pdf
Dataset:
 2023.eacl-main.206.dataset.pdf
Video:
 https://aclanthology.org/2023.eacl-main.206.mp4