ForumSum: A Multi-Speaker Conversation Summarization Dataset

Abstractive summarization quality had large improvements since recent language pretraining techniques. However, currently there is a lack of datasets for the growing needs of conversation summarization applications. Thus we collected ForumSum 1 , a diverse and high-quality conversation summarization dataset with human written summaries. The conversations in ForumSum dataset are collected from a wide variety of internet forums. To make the dataset easily expandable, we also release the process of dataset creation. Our experiments show that models trained on ForumSum have better zero-shot and few-shot transferability to other datasets than the existing large chat summarization dataset SAMSum. We also show that using a conversational corpus for pre-training improves the quality of the chat summarization model.


Introduction
With increasing number of digital communications, there is an increasing need to manage the exploding amount of information. One way of relieving users from information overload in chat applications is through automatic abstractive summarization by selecting important pieces of information and writing them into accurate, fluent and concise summaries.
Recently there has been a lot of advances in automatic abstractive summarization using large pretrained language models (Zhang et al., 2020;Lewis et al., 2020;Raffel et al., 2020) and finetuning them on downstream summarization datasets. However, most of the pre-training and finetuning domains are news documents (Narayan et al., 2018;See et al., 2017) and there is a lack of attention to summarizing conversations.
In this work, we aim to build a high quality conversation summarization system that generalizes 1 The dataset is available at tensorflow dataset and huggingface.
well by creating a new dataset and improving pretraining methods for conversation summarization.
Our contributions include: • We collected a diverse and high-quality conversational summarization dataset from 281 internet forums and release the dataset creation process to make it easily expandable.
• Our experiments show that models trained on ForumSum transfer better to new domains compared to SAMSum dataset.
• We show that pre-training on conversational corpus improves the quality of chat summarization models.

Related Works
SAMSum (Gliwa et al., 2019) is a dataset of 16k high-quality chat-dialogues corpus and their abstractive dialogue summaries manually written by linguists. Linguists are asked to create informal, semi-formal and formal conversations similar to their daily messenger conversations including chitchats, gossiping about friends, arranging meetings, discussing politics, consulting university assignments with colleagues, etc. Despite its large size and excellent quality, the conversations styles are relatively homogeneous and 75% of the conversations are between two people, whereas summarizing conversations that involve many speakers is a more useful scenario in real world applications.
Ubuntu/NYC (Bhatia et al., 2014) is an online thread summarization datasets that contains 100 threads from ubuntuforums.org and tripadvisor.com and their human written summaries.
BC3 (Ulrich et al., 2008) consists of 40 email threads each annotated with three summaries by three different annotators. Each summary sentence is also annotated with references to the corresponding lines in the emails.
We employ BC3, Ubuntu and NYC datasets to evaluate transferability of models trained Forum-Sum and SAMSum datasets.
MediaSum, SummScreen MediaSum (Zhu et al., 2021) and SummScreen (Chen et al., 2021) are conversations summarization datasets that use speech transcripts as input and automatically mine summaries from interview overviews and TV shows recaps. While being very large and diverse they contain automatically mined summaries which might suffer from lower quality.
Meeting transcripts are much longer than messenger or email threads. For example, an average AMI transcript contains 289 turns, while the average number of turns in ForumSum dataset is around 10. Meetings transcripts contain more repetitions, backchannel responses and interjections.
In this paper we focus on summarizing online messaging conversations. Therefore we do not use MediaSum, SummScreen, AMI or ICSI for transferability studies.

ForumSum Dataset
Motivated by the lack of diverse multi-speaker conversation summarization datasets we collected Fo-rumSum: a conversation summarization dataset from internet forums labeled with human-written summaries.
First we collected a list of message board websites. We only kept websites that are relatively popular, and are not present in a blocklist of not appropriate websites. The result contains 281 websites.

Conversation Selection
We scraped all posts with comments from the forums. We combine topic starters and corresponding sequences of comments into conversations.
To get a cleaner and more diverse set of conversations we applied the following filters: • Filtered out conversations that contained scraping artifacts such as XML tags • Filtered out conversations that contain any offensive word from a list of of English obscene words and collocations. 2 • Sample 200 maximum conversations per website to smooth the websites distribution.
• Filtered out conversations where there is only a single speaker.
• Filtered out short conversations that has less than 4 turns.
Conversations that passed this set of filters were sent to be annotated with summaries.

Crowd-source Annotation
We used Amazon Mechanical Turk to annotate the conversations with human-written summaries.
To guard summary quality we split the dataset into batches of 100-200 examples and sent the conversations to annotators batch-by-batch. After acquiring the results of each batch we manually assessed the summaries quality on a scale from 1 to 5, assessed common issues and made changes to the instructions.
After 4 such iterations we stopped making changes into the instructions set. However, we kept evaluating samples of summaries in each batch to ensure the quality of the summaries does not drop. All batches after the finalized instructions got uniformly good average scores between 4.3 and 4.7. Only batches written after finalized instructions are included in the ForumSum dataset.
See Appendix B for full final instructions.

ForumSum Style and Format
ForumSum conversations are formatted similarly as conversations in the SAMSum dataset: each utterance starts on a new line, contains an author name and a message text that separated with a colon. ForumSum summaries also has similar third person style as SAMSum summaries, but are longer and more descriptive. See Appendix A for examples of ForumSum conversations and summaries.

Statistics
Table 1 and Figure 1 show that ForumSum dataset contains significantly more multi-speaker threads, longer utterances than SAMSum and their distributions are more spread out. ForumSum summaries are longer than SAMSum summaries on average.  Also, ForumSum conversations have richer vocabulary: they contain 66% more unique words for approximately same number of total words.

Experiments and Results
Following (Zhang et al., 2020), we conduct our studies using transformer encoder-decoder models at the BASE size. It had L = 12, H = 768, F = 3072, A = 12 where L denotes the number of layers for encoder and decoder (i.e. Transformer blocks), H for the hidden size, F for the feedforward layer size and A for the number of selfattention heads.

Zero-shot and Few-shot Transferability
We finetune a pretrained Pegasus-Base (Zhang et al., 2020) model on ForumSum and SAM-Sum datasets respectively and then study their transferability to out-of-domain chat summarization datasets. We chose to evaluate on Ubuntu/NYC/BC3 because they are very small and contain online-messaging conversations. None of them overlaps with SAMSum/ForumSum datasets. In zero-shot setting, we directly evaluate models' performance and in few-shot settings we finetune on all training examples in those small datasets.
For BC3 dataset we treat each original summary sentence as an independent summary and construct synthetic conversations using annotated references to email lines. Then we format all input data in Ubuntu/NYC/BC3 consistently with SAM-Sum/ForumSum as described in Section 3.3. Table 2 show that finetuning Pegasus first on either SAMSum or ForumSum and then to other smaller datasets improves models performance. Furthermore, ForumSum models transfer to Ubuntu/NYC/BC3 better than SAMSum models in both zero-shot and few-shot settings. This all suggests the variety of conversation distribution in ForumSum help generalization to out of domain datasets. More experimental details are found in Appendix D.

Human Evaluation
We conducted side-by-side human evaluation comparing the predictions from Pegasus+SAMSum and Pegasus+ForumSum on the test sets of SAM-Sum, ForumSum, Ubuntu and NYC. Trained human raters, given a chat thread of two summaries in randomized order, are asked to rater compare them in seven categories. More details can be found in Appendix E. Table 3 show the distribution of human rater's preferences on all downstream domains. SAMSum and ForumSum model both perform better when evaluated on the domains they are trained on. Fo-rumSum models generalize better to other domains such as Ubuntu. Those findings are aligned with the ROUGE scores in Section 4.1.

Dataset Expansion
Can further dataset expansion potentially improve the quality of our models? To answer that question we evaluated models trained on different number of examples randomly chosen from the training dataset. Extrapolating the relation between quality and number of training examples, we can further predict if adding more data from the same distribution would lead to quality improvements.   As shown in Figure 2, both datasets benefit from more training examples as ROUGE scores go up, suggesting further dataset expansion for both SAMSum and ForumSum datasets would further improve summary quality. More details in Appendix D.

Effect of Pretraining Corpus
We studied whether pretraining on conversational data helps conversation summarization models.
We collected Forums corpus in a similar way we collected source data for ForumSum dataset. To make the pre-training corpus as large as possible we used 56569 forums and didn't apply any filters described in 3.1, but removed all examples included in the ForumSum dataset from the corpus. The pre-training corpus contained around 516M conversations.
We pretrained Pegasus-Base model on the forum corpus for 600K steps and finetuned them on SAMSum and ForumSum datasets. Table 4, pretraining on conversational corpora improves conversation summarization models' performance. See Appendix D for details and experiments hyper-parameters.

Conclusions
We collected ForumSum, a diverse and high-quality chat summarization dataset with human written summaries. ForumSum can be easily expanded to further improve conversation summarization quality using the released process of dataset creation. Our experiments show that models trained on ForumSum have good zero-shot and few-shot transferability to other conversation summarization datasets measured by ROUGE scores and human evaluations. We also show that using a conversational corpus for pre-training improves the quality of the conversation summarization model. Table 5 contains examples of conversations and summaries from the ForumSum dataset.

A Sample Data
Ttechhunter: Im shooting a carbon element and I have the QAD Hoyt rest on it. I cannot adjust the windage of the rest without taking the rest off.
Anyone have this problem? They make an Allen wrench that will fit between the riser and the rest? Thanks splitbeam145: they do have a short allen wrench that will fit it. Most shops will have several laying around.

B MTurk Template
Here're instructions that were shown to the MTurk workers.

Write a summary of the conversation
Read the conversation and write a short summary.
• Be concise. Only cover main ideas and topics. Don't recite every message in the conversation. Try to fit the summary into 1-3 short sentences unless the conversation is long and there're multiple subjects discussed.
• Be specific. The summary must contain the main outcome of the conversation, not just the topic.
-Good: "Jack can't install Windows 7 because of a broken license. Ann provided a working security key and Bob gave instructions for the update." -Bad: "Several users provide Jack with help troubleshooting his computer issues." • Use third-person form e.g.
-Good: "Ann likes oranges" -Bad: "I like oranges" • Prefer usernames instead of common words like "user" and "people". Spell usernames as they are spelled in the conversation.
-Good: "Ann asked" -Bad: "A user asked" • Avoid words that don't add meaning -Good: "Ann and John discuss..." -Bad: "This seems to be a conversation where people discuss" • Be objective. Avoid judgemental comments.
-Good: "Ann and John make jokes about" -Bad: "Ann and John make stupid jokes about" • The summary must be grammatically correct. Start sentences with a capital letter and use punctuation marks.
• The conversation might contain some unknown terminology. That's okay. Try figuring out what the conversation is about or google the words you don't know.

C Forums statistics
See Table 6 for the most frequent message board websites used to build ForumSum dataset. Full list of websites with counts is available at https: //pastebin.com/w6wUDQx3.

D Experiment Hyper-parameters
See Table 7 for all hyperparameters we used in the experiments.

D.1 Transfer Study
For Ubuntu and NYC datasets we select 50 random examples into the validation sets and leave the other 50 in the training set. For BC3 we select 13 emails threads into validation set and use the other 27 emails threads as a training set. We report validation numbers for all models.

D.2 Pretraining Study
For baseline pretraining experiments we used a mixture of C4 and HugeNews datasets. See (Zhang et al., 2020) for more details about these datasets.

Parameter Value Total parameters 223M
Learning rate 1e-4 Dropout rate 0.1 Label smoothing 0.1 Batch size 1024 Max input tokens 512 Max target tokens 128 Beam size 5 Beam alpha 0.8