Automatic Generation of Large-scale Multi-turn Dialogues from Reddit

Daniil Huryn, William M. Hutsell, Jinho D. Choi


Abstract
This paper presents novel methods to automatically convert posts and their comments from discussion forums such as Reddit into multi-turn dialogues. Our methods are generalizable to any forums; thus, they allow us to generate a massive amount of dialogues for diverse topics that can be used to pretrain language models. Four methods are introduced, Greedy_Baseline, Greedy_Advanced, Beam Search and Threading, which are applied to posts from 10 subreddits and assessed. Each method makes a noticeable improvement over its predecessor such that the best method shows an improvement of 36.3% over the baseline for appropriateness. Our best method is applied to posts from those 10 subreddits for the creation of a corpus comprising 10,098 dialogues (3.3M tokens), 570 of which are compared against dialogues in three other datasets, Blended Skill Talk, Daily Dialogue, and Topical Chat. Our dialogues are found to be more engaging but slightly less natural than the ones in the other datasets, while it costs a fraction of human labor and money to generate our corpus compared to the others. To the best of our knowledge, it is the first work to create a large multi-turn dialogue corpus from Reddit that can advance neural dialogue systems.
Anthology ID:
2022.coling-1.297
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
3360–3373
Language:
URL:
https://aclanthology.org/2022.coling-1.297
DOI:
Bibkey:
Cite (ACL):
Daniil Huryn, William M. Hutsell, and Jinho D. Choi. 2022. Automatic Generation of Large-scale Multi-turn Dialogues from Reddit. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3360–3373, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
Automatic Generation of Large-scale Multi-turn Dialogues from Reddit (Huryn et al., COLING 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.coling-1.297.pdf
Data
Blended Skill TalkWizard of Wikipedia