CMCC: A Comprehensive and Large-Scale Human-Human Dataset for Dialogue Systems

Yi Huang, Xiaoting Wu, Si Chen, Wei Hu, Qing Zhu, Junlan Feng, Chao Deng, Zhijian Ou, Jiangjiang Zhao


Abstract
Dialogue modeling problems severely limit the real-world deployment of neural conversational models and building a human-like dialogue agent is an extremely challenging task. Recently, data-driven models become more and more prevalent which need a huge amount of conversation data. In this paper, we release around 100,000 dialogue, which come from real-world dialogue transcripts between real users and customer-service staffs. We call this dataset as CMCC (China Mobile Customer Care) dataset, which differs from existing dialogue datasets in both size and nature significantly. The dataset reflects several characteristics of human-human conversations, e.g., task-driven, care-oriented, and long-term dependency among the context. It also covers various dialogue types including task-oriented, chitchat and conversational recommendation in real-world scenarios. To our knowledge, CMCC is the largest real human-human spoken dialogue dataset and has dozens of times the data scale of others, which shall significantly promote the training and evaluation of dialogue modeling methods. The results of extensive experiments indicate that CMCC is challenging and needs further effort. We hope that this resource will allow for more effective models across various dialogue sub-problems to be built in the future.
Anthology ID:
2022.seretod-1.7
Volume:
Proceedings of the Towards Semi-Supervised and Reinforced Task-Oriented Dialog Systems (SereTOD)
Month:
December
Year:
2022
Address:
Abu Dhabi, Beijing (Hybrid)
Editors:
Zhijian Ou, Junlan Feng, Juanzi Li
Venue:
SereTOD
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
48–61
Language:
URL:
https://aclanthology.org/2022.seretod-1.7
DOI:
10.18653/v1/2022.seretod-1.7
Bibkey:
Cite (ACL):
Yi Huang, Xiaoting Wu, Si Chen, Wei Hu, Qing Zhu, Junlan Feng, Chao Deng, Zhijian Ou, and Jiangjiang Zhao. 2022. CMCC: A Comprehensive and Large-Scale Human-Human Dataset for Dialogue Systems. In Proceedings of the Towards Semi-Supervised and Reinforced Task-Oriented Dialog Systems (SereTOD), pages 48–61, Abu Dhabi, Beijing (Hybrid). Association for Computational Linguistics.
Cite (Informal):
CMCC: A Comprehensive and Large-Scale Human-Human Dataset for Dialogue Systems (Huang et al., SereTOD 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.seretod-1.7.pdf
Video:
 https://aclanthology.org/2022.seretod-1.7.mp4