Multi 3 WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems

Songbo Hu; Han Zhou; Mete Hergul; Milan Gritta; Guchun Zhang; Ignacio Iacobacci; Ivan Vulić; Anna Korhonen

doi:10.1162/tacl_a_00609

Multi 3 WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems

Songbo Hu, Han Zhou, Mete Hergul, Milan Gritta, Guchun Zhang, Ignacio Iacobacci, Ivan Vulić, Anna Korhonen

Abstract

Creating high-quality annotated data for task-oriented dialog (ToD) is known to be notoriously difficult, and the challenges are amplified when the goal is to create equitable, culturally adapted, and large-scale ToD datasets for multiple languages. Therefore, the current datasets are still very scarce and suffer from limitations such as translation-based non-native dialogs with translation artefacts, small scale, or lack of cultural adaptation, among others. In this work, we first take stock of the current landscape of multilingual ToD datasets, offering a systematic overview of their properties and limitations. Aiming to reduce all the detected limitations, we then introduce Multi3WOZ, a novel multilingual, multi-domain, multi-parallel ToD dataset. It is large-scale and offers culturally adapted dialogs in 4 languages to enable training and evaluation of multilingual and cross-lingual ToD systems. We describe a complex bottom–up data collection process that yielded the final dataset, and offer the first sets of baseline scores across different ToD-related tasks for future reference, also highlighting its challenging nature.

Anthology ID:: 2023.tacl-1.79
Volume:: Transactions of the Association for Computational Linguistics, Volume 11
Month:
Year:: 2023
Address:: Cambridge, MA
Venue:: TACL
SIG:
Publisher:: MIT Press
Note:
Pages:: 1396–1415
Language:
URL:: https://aclanthology.org/2023.tacl-1.79/
DOI:: 10.1162/tacl_a_00609
Bibkey:
Cite (ACL):: Songbo Hu, Han Zhou, Mete Hergul, Milan Gritta, Guchun Zhang, Ignacio Iacobacci, Ivan Vulić, and Anna Korhonen. 2023. Multi 3 WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems. Transactions of the Association for Computational Linguistics, 11:1396–1415.
Cite (Informal):: Multi 3 WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems (Hu et al., TACL 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.tacl-1.79.pdf
Video:: https://aclanthology.org/2023.tacl-1.79.mp4

PDF Cite Search Video Fix data