On the Need for Thoughtful Data Collection for Multi-Party Dialogue: A Survey of Available Corpora and Collection Methods

Khyati Mahajan, Samira Shaikh


Abstract
We present a comprehensive survey of available corpora for multi-party dialogue. We survey over 300 publications related to multi-party dialogue and catalogue all available corpora in a novel taxonomy. We analyze methods of data collection for multi-party dialogue corpora and identify several lacunae in existing data collection approaches used to collect such dialogue. We present this survey, the first survey to focus exclusively on multi-party dialogue corpora, to motivate research in this area. Through our discussion of existing data collection methods, we identify desiderata and guiding principles for multi-party data collection to contribute further towards advancing this area of dialogue research.
Anthology ID:
2021.sigdial-1.36
Volume:
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue
Month:
July
Year:
2021
Address:
Singapore and Online
Editors:
Haizhou Li, Gina-Anne Levow, Zhou Yu, Chitralekha Gupta, Berrak Sisman, Siqi Cai, David Vandyke, Nina Dethlefs, Yan Wu, Junyi Jessy Li
Venue:
SIGDIAL
SIG:
SIGDIAL
Publisher:
Association for Computational Linguistics
Note:
Pages:
338–352
Language:
URL:
https://aclanthology.org/2021.sigdial-1.36
DOI:
10.18653/v1/2021.sigdial-1.36
Bibkey:
Cite (ACL):
Khyati Mahajan and Samira Shaikh. 2021. On the Need for Thoughtful Data Collection for Multi-Party Dialogue: A Survey of Available Corpora and Collection Methods. In Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 338–352, Singapore and Online. Association for Computational Linguistics.
Cite (Informal):
On the Need for Thoughtful Data Collection for Multi-Party Dialogue: A Survey of Available Corpora and Collection Methods (Mahajan & Shaikh, SIGDIAL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.sigdial-1.36.pdf
Video:
 https://www.youtube.com/watch?v=1PJRwGVxMEs
Data
CRD3InterviewMELDMolweniOpenSubtitlesSerial Speakers