ConvoSumm: Conversation Summarization Benchmark and Improved Abstractive Summarization with Argument Mining

While online conversations can cover a vast amount of information in many different formats, abstractive text summarization has primarily focused on modeling solely news articles. This research gap is due, in part, to the lack of standardized datasets for summarizing online discussions. To address this gap, we design annotation protocols motivated by an issues–viewpoints–assertions framework to crowdsource four new datasets on diverse online conversation forms of news comments, discussion forums, community question answering forums, and email threads. We benchmark state-of-the-art models on our datasets and analyze characteristics associated with the data. To create a comprehensive benchmark, we also evaluate these models on widely-used conversation summarization datasets to establish strong baselines in this domain. Furthermore, we incorporate argument mining through graph construction to directly model the issues, viewpoints, and assertions present in a conversation and filter noisy input, showing comparable or improved results according to automatic and human evaluations.


Introduction
Automatic text summarization is the process of outputting the most salient parts of an input in a concise and readable form. Recent work in summarization has made significant progress due to introducing large-scale datasets such as the CNN-DailyMail dataset (Nallapati et al., 2016) and the New York Times dataset (Sandhaus, 2008). Furthermore, the use of large self-supervised pretrained models such as BART (Lewis et al., 2020) and Pegasus (Zhang et al., 2019) has achieved stateof-the-art performance across summarization tasks and strong performance in zero and few-shot settings (Fabbri et al., 2020a). However, less work has focused on summarizing online conversations. Several commenters list their favorite things about the Super Bowl, including half-time shows, the funny commercials, the Puppy Bowl, eating food, and spending time with family. A couple of commenters admit to not being football fans but still enjoying the Super Bowl. Some commenters discuss whether they thought the Falcons or the Patriots were going to win, while others list teams they wish were in the game. Table 1: Example summary of comments from a New York Times article discussing people's favorite parts of the Super Bowl. The summary is an analysis of the comments and quantifies the viewpoints present.
Unlike documents, articles, and scientific papers, which contain specific linguistic structures and conventions such as topic sentences and abstracts, conversational text scatters main points across multiple utterances and between numerous writers. As a result, the text summarization task in the conversational data domain offers a challenging research field to test newly-developed models (Chen and Yang, 2020).
Recently, Gliwa et al. (2019a) introduced a dataset for chat-dialogue conversation summarization consisting of 16k examples, the first largescale dataset of its kind. Previous work in conversation summarization was limited by the data available and focused primarily on meeting summarization, such as the AMI (Kraaij et al., 2005) and ICSI (Janin et al., 2003) datasets. The datasets used in recent conversation papers are often not uniform, ranging from visual dialogue data (Goo and Chen, 2018a) to customer-service dialogues (Yuan and Yu, 2019), not initially intended for summarization. The availability of benchmark datasets for comparing methods has limited work in other conversation summarization domains and thus likely inhibited progress (Kryscinski et al., 2019;Fabbri et al., 2020b).
We aim to address this research gap by crowdsourcing a suite of four datasets, which we call ConvoSumm, that can evaluate a model's performance on a broad spectrum of conversation data. In determining the domains of data to collect, we use the general definition of conversation as "any discourse produced by more than one person" (Ford, 1991). We identify several key categories of data for which standard human-created development and testing datasets do not exist, namely (1) news article comments, (2) discussion forums and debate, (3) community question answering, and (4) email threads. We design annotation protocols motivated by work in quantifying viewpoints present in news comment data (Barker and Gaizauskas, 2016a) to crowdsource 250 development and 250 test examples for each of the above domains. We provide an example of comments to a New York Times news article, and our crowdsourced summary in Table 1.
In addition to introducing manually-curated datasets for conversation summarization, we also aim to unify previous work in conversation summarization. Namely, we benchmark a state-of-the-art abstractive model on several conversation datasets: dialogue summarization from SAMSum (Gliwa et al., 2019b), heuristic-generated community question answering from CQASumm (Chowdhury and Chakraborty, 2018), meeting summarization data from AMI and ICSI, and smaller test sets in the news comments, discussion forum, and email domains. We believe that such benchmarking will facilitate a more straightforward comparison of conversation summarization models across domains.
To unify modeling across these conversational domains, we propose to use recent work in end-toend argument mining (Lenz et al., 2020;Stab and Gurevych, 2014;Chakrabarty et al., 2019) to instantiate the theoretical graph framework which motivated our annotation protocol, proposed by Barker and Gaizauskas (2016a) for conversation summarization. This protocol is employed to both identify and use the "issues-viewpoints-assertions" argu-ment structure (discussed in Related Work) for summarizing news comments. We construct this argument graph using entailment relations, linearize the graph, train a graph-to-text model (Ribeiro et al., 2020), and experiment with argument mining as a way to reduce noise in long-text input.
Our contributions are the following: (1) we crowdsource datasets for four domains of conversational data and analyze the characteristics of our proposed datasets; (2) we benchmark state-of-theart models on these datasets as well as previous widely-used conversation summarization datasets to provide a clear baseline for future work; and (3) we apply argument mining to model the structure of our conversational data better as well as reduce noise in long-text input, showing comparable or improved results in both automatic and human evaluations. 1

Related Work
Modeling Conversation Summarization Early approaches to conversation summarization consisted of feature engineering (Shasha Xie et al., 2008), template selection methods (Oya et al., 2014), and statistical machine learning approaches (Galley, 2006;Wang and Cardie, 2013). More recent modeling approaches for dialogue summarization have attempted to take advantage of conversation structures found within the data through dialogue act classification (Goo and Chen, 2018b), discourse labeling (Ganesh and Dingliwal, 2019), topic segmentation (Liu et al., 2019c), and keypoint analysis (Liu et al., 2019a). Chen and Yang (2020) utilize multiple conversational structures from different perspectives in its sequence-tosequence model. However, such approaches focus exclusively on dialogue summarization, and it is not trivial to extend such methods to longer conversations with many more participants. We thus introduce a method to model the structure of the discourse over the many-party conversation.
Several existing works have focused on conceptualizing conversation structure for summarization and how to present this structure to endusers. Barker et al. (2016a) propose a conversation overview summary that aims to capture the key argumentative content of a reader comment conversation. Misra et al. (2017) use summarization as a means of probing online debates to discover central propositions, which they cluster to identify argument facets. Barker and Gaizauskas (2016b) identify three key components of conversational dialogue: issues (that individuals discuss), viewpoints (that they hold about these issues), and assertions (that they make to support their viewpoints). We build on this framework and advances in argument mining for end-to-end training for summarization.
Argument Mining Work in argument mining (Stab and Gurevych, 2014) has aimed to identify these argumentative units and classify them into claims, premises, and major claims, or claims describing the key concept in a text. More recently, Chakrabarty et al. (2019) propose to finetune BERT (Devlin et al., 2019) for identifying argumentative units and relationships between them within a text and across texts. Lenz et al. (2020) are the first to propose an end-to-end approach for constructing an argument graph (Stede et al., 2016), a structured representation of claims and premises in an argumentative text; the graph is built by connecting claim and premise argumentative discourse units. We build on this framework for modeling discourse in conversational data.
Few-Shot Summarization As the datasets we introduce are not on a scale with larger datasets, we focus on few-shot and domain transfer summarization techniques.  examine domain adaptation in extractive summarization, while Hua and Wang (2017) examine domain adaptation between opinion and news summarization. Within unsupervised abstractive summarization, several approaches have made use of variational autoencoders (Baziotis et al., 2019;Chu and Liu, 2019;Bražinskas et al., 2020) and pretrained language models (Zhou and Rush, 2019;Laban et al., 2020).
Recent work in abstractive (Zhang et al., 2019;Fabbri et al., 2020a) and extractive-compressive summarization (Desai et al., 2020) has shown the power of pretrained models for a few-shot transfer. The quality of models trained on several hundred examples in these papers is comparable to that of models trained on the equivalent full datasets. Thus, we believe that introducing curated validation and testing datasets consisting of a few hundred examples is a valuable contribution within the current paradigm, which was confirmed by the poor performance of models transferred from other domains compared to that trained on this validation data.

ConvoSumm
In this section, we introduce our dataset selection, our annotation protocol, and the characteristics of our crowdsourced dataset.
Data Selection For the news comments subdomain, we use the NYT Comments dataset, which consists of 2 million comments made on 9,000 New York Times articles published between 2017 and 2018. It is publicly available and has been used in work for news-comment relevance modeling (Kolhatkar and Taboada, 2017); it also contains metadata that may be of use in summarization modeling. For the discussion forums and debate subdomain, we select Reddit data from CoarseDiscourse (Zhang et al., 2017), which contains annotations about the discourse structure of the threads. For the community question answering subdomain, we use StackExchange (Stack), which provides access to all forums and has been used in modeling for answer relevance and question deduplication (Hoogeveen et al., 2015). We chose StackExchange over the commonly-used Yahoo! Answers data due to licensing reasons. For the email threads subdomain, we use the publicly-available W3C corpus (Craswell et al., 2005). Previous work also made use of this dataset for email summarization (Ulrich et al., 2008) but provided only a small sample of 40 email threads, for which we provide transfer testing results.
We generally follow the guidance of Tomasoni and Huang (2010), from summarizing community question answering forums, for determining which subsets of data to select from the above datasets. We remove an example if (1) there were less than five posts (four in the case of email threads; "post" refers to any answer, comment, or email); (2) the longest post was over 400 words; (3) the sum of all post lengths was outside of [100, 1400] words (although we extended this maximum length for NYT comments); or (4) the average length of the posts was outside of the [50, 300] words interval. For Stack data, we first filtered answers which received a negative community rating, as defined by the number of user upvotes minus the number of user downvotes. While real-world settings may contain much longer threads, we later show that this setting is already challenging.
Annotation Protocol We designed annotation instructions for crowdsourced workers to write abstractive summaries for each of the four   datasets, motivated by work in summarizing viewpoints present in online conversation (Barker and Gaizauskas, 2016a). We present the crowdsource workers with the data threads, along with any available metadata. For NYT, we presented the workers with the article headline, keywords, and, rather than providing the entire article as context, an extractive BERT-based summary (Miller, 2019) of the article. We use a BERT summary to give the annotators an idea of the topic of the article. We avoided having annotators read the entire article since the focus of their summaries was solely the content of the comments as per the annotation protocols, and reading the entire article could end up introducing information in the summaries that was not necessarily representative of the comments' main points. We found that these summaries were useful in initial in-house annotations, and allowed us to better understand the context of the comments being summarized. For Reddit and Stack, question tags and information about the subforum were provided; the Stack data includes both answers and answer comments. Reddit data was filtered simply on word limits due to the unavailability of up/down votes from the Coarse Discourse data. Stack data includes the prompt/title as well. Whenever possible, we included username information and the scores of all comments, posts, and answers. Although the instructions differed slightly with the specific nuances of each dataset, they had standard overall rules: (1) summaries should be an anal-ysis of the given input rather than another response or utterance; (2) summaries should be abstractive, i.e., annotators were required to paraphrase and could not repeat more than five words in a row from the source; and (3) summary lengths should contain [40,90] tokens. Following the issues-viewpointsassertions framework presented in Barker and Gaizauskas (2016b), we also instructed annotators that summaries should summarize all viewpoints in the input and should try to include specific details from assertions and anecdotes (unless this made the summary too lengthy). Summarizing based on similar viewpoints is analogous to clustering then summarizing, similar to the comment label grouping procedure before summarization in Barker et al. (2016b). To help with this, we recommended wording such as "Most commenters suggest that..." and "Some commenters think that..." to group responses with similar viewpoints.
However, the email dataset was unique among the selected datasets given that it contained more back-and-forth dialogue than clusters of viewpoints, and thus identifying the speakers was essential to creating summaries that still retained meaning from the original email dialogue. Since the email threads contained fewer individual speakers than the other datasets, this sort of summarization remained feasible. Thus, for this dataset, annotators were instructed to specify the speakers when summarizing the conversation.

Quality-Controlled Crowdsourcing
We crowdsourced our data using Amazon Mechanical Turk. We required that our workers be native English speakers and pass a qualifying exam for each domain to be summarized. We worked with a select group of about 15 workers who formed a community of high-quality annotators. Example summaries were provided to the workers. The workers submitted the qualifying exam, and then one of the authors of this paper provided feedback. If the worker was not sure of the quality of the summaries written, at any point, they could enlist the input of one of the authors.
Additionally, after the workers wrote all summaries, we manually reviewed every summary and made corrections to grammar, wording, and overall structure. Summaries we could not fix ourselves, either because they were poorly written or did not follow the annotation protocols, were flagged to be re-written. They were then sent to our approved group of workers to be re-written, excluding any workers who had written a flagged summary. While data crowdsourced from non-experts may contain noise (Gillick and Liu, 2010), we believe that our setup of working closely with a small group of workers, providing feedback to individual workers, and manually reviewing all final summaries mitigates these issues.

Dataset Statistics
We provide statistics in Table 2. The percentage of novel n-grams in our summaries is higher than that of the very abstractive XSum dataset (Narayan et al., 2018) (35.76/83.45/95.50 -% novel uni/bi/tri-grams). This level of abstraction is likely due to the instructions to perform abstractive summarization and the summaries being an analysis of the input, which results in the insertion of new words (e.g. "commenters" likely isn't seen in the input). The influence of this abstraction is further seen by an analysis of the Extractive Oracle, for which we show ROUGE-1/2/L (Lin, 2004). We see that the performance of an extractive model is above the Extractive Oracle on the very abstractive XSum (Narayan et al., 2018) (29.79 ROUGE-1), but much lower than the Extractive Oracle on the CNN-DailyMail (CNNDM) dataset (Nallapati et al., 2016) (>50 ROUGE-1). The summary lengths are fairly consistent, while the input lengths are the longest for NYT and Stack data. We include the title and additional meta-data such as the headline and snippet in NYT data in input length calculations.
We analyze multi-document summarizationspecific characteristics of our datasets, as proposed by Dey et al. (2020a). In particular, inter-document similarity measures the degree of overlap of semantic units in the candidate documents, with scores further from zero signifying less overlap. The notion introduced for redundancy measures the overall distribution of semantic units; the farther the score is from zero, the more uniform semantic units are across the entire input, with the maximum when each unit is present only once. Layout bias mea-sures the similarity of multi-sentential documents with the reference. For more precise definitions, we refer the reader to Dey et al. (2020a). We provide results for our data in Table 3. Email data exhibits the most inter-document similarity, which follows the intuition that an email thread consists of a focused discussion typically on a single topic. For redundancy, we see Reddit shows the most uniform distribution of semantic units, perhaps due to Reddit threads' less focused nature compared to the remaining datasets. We do not see a particularly strong layout bias across any parts of the input documents. Our datasets exhibit greater or comparable levels of novel-ngrams compared to multi-document summarization datasets such as MultiNews (Fabbri et al., 2019) and CQASUMM (Chowdhury and Chakraborty, 2018). Our Stack subset has lower inter-document similarity, which presents challenges for models which rely strictly on redundancy in the input, and our datasets generally exhibit less layout bias, when compared to the analysis done in Dey et al. (2020b).

Comparison to Existing Datasets
Although previous work on conversation summarization, before the introduction of SAMSum (Gliwa et al., 2019b), has largely featured unsupervised or fewshot methods, there exist several datasets with reference summaries. These include SENSEI (Barker et al., 2016b) for news comments, the Argumentative Dialogue Summary Corpus (ADS) (Misra et al., 2015) for discussion forums, and the BC3 (Ulrich et al., 2009) dataset for email data. However, much of the existing datasets are not wide in scope. For example, SENSEI only covers six topics and the ADS Corpus covers one topic and only has 45 dialogues. Furthermore, they each pertain to one subdomain of conversation. Our dataset avoids these issues by covering four diverse subdomains of conversation and having approximately 500 annotated summaries for each subdomain. Additionally, since neural abstractive summarization baselines do not exist for these datasets, we benchmark our models on these datasets to further their use as test sets. We similarly include the AMI and ICSI meeting datasets within our benchmark.
Within community question answering, the Wik-iHowQA dataset (Deng et al., 2020) consists of user response threads to non-factoid questions starting with "how to," including labels for the answer selection task and reference summaries. The CQASUMM dataset (Chowdhury and Chakraborty, 2018) sampled threads from Yahoo! Answers in which the best answer could be used as a reference summary. However, this heuristic is not guaranteed to cover all the user answers' perspectives, so we believe our dataset is a more principled benchmark for community question answering.
It is also noted that several large-scale MDS datasets have been introduced in the news domain (Fabbri et al., 2019;Gu et al., 2020;Gholipour Ghalandari et al., 2020), for creating Wikipedia leadparagraphs (Liu et al., 2018), and for long-form question answering (Fan et al., 2019). However, these do not focus on the conversational domain.

Argument Graph Summarization
As our annotation protocol is motivated by the issues-viewpoints-assertions framework proposed in Barker and Gaizauskas (2016a), we propose to instantiate a modified version of that work's theoretical, proposed graph model.

Argument Graph Construction
We build on the argument graph formulation of Lenz et al. (2020), a variant of Argument Interchange Format (Chesnevar et al., 2006). Claims and premises are represented as information nodes (I-nodes), with the relations between them represented as scheme nodes (S-nodes). Let V = I ∪ S be the set of nodes, and E ⊂ V × V the set of edges describing support relationships among the nodes. We then define the argument graph G = (V, E). Lenz et al. (2020) breaks the construction of the argument graph down into four steps: (1) argument extraction, or the identification of argumentative discourse units; (2) relationship type classification, or the classification of edges between nodes; (3) major claim detection; and (4) graph construction, or the construction of the final graph based on the identified nodes and edges. To adapt this formulation to our multi-document setting, we first perform argument extraction and relationship type classification for each individual input document and finally graph construction to determine relationships among claims from all documents.
Argument Extraction For extracting arguments from a single document, we build on work in argument mining with pretrained models (Chakrabarty et al., 2019). As in Lenz et al. (2020), our argumentative units are sentences, from which we identify claims, which are assertions that something is true, and premises, which are propositions from which a conclusion is drawn. Additionally, we identify and remove non-argumentative units. We train a threeway classifier for the task of argument extraction, following Chakrabarty et al. (2019) and making use of data for argument mining from that paper and from Stab and Gurevych (2014). The output of this step can also simply be used without further graph construction as a less noisy version of the input, which we call -arg-filtered.
Relationship Type Classification We follow the procedure in Lenz et al. (2020) and use entailment to determine the relationship between argumentative units within a document. However, rather than using the classifier provided, we make use of RoBERTa (Liu et al., 2019b) fine-tuned on the MNLI entailment dataset (Williams et al., 2018). Rather than using both support and contradiction edges between claims and premises, we make the simplification that all relationships can be captured with support edges, as we are dealing with a single document in this step. Within a single text, the  premise can be tied as following from one of the claims. We create an edge between any premise and the claim it most entails if the entailment score from RoBERTa is greater than 0.33, based on manual analysis of the scores. If a premise is not labeled as supporting a claim, then we heuristically create an edge between that premise and the closest claim preceding it in the text.
Since not all texts in the benchmark datasets may be argumentative or may be too short to contain major claims, we use some heuristics in our graph creation. If none of the argumentative sentences are labeled as claims (i.e., all are labeled as premises) in argument extraction, the text's first sentence is labeled as the claim. Furthermore, we do not identify a single claim as the major claim since there may be multiple major points of discussion.
Graph Construction For the final graph, for each of the documents in an example, we run the above procedure and obtain a set of claims and associated premises. We then identify support edges between claims, which may be across documents. One claim may make a larger assertion, which is supported by other claims. We run our entailment model over all potential edges (in both directions) among claims in the document and greedily add edges according to the entailment support score while no cycles are made. After this step, we are left with a set of claims which do not entail any other nodes or, stated otherwise, do not have parent nodes. Following the terminology of Barker and Gaizauskas (2016b), these nodes can be considered viewpoints.
We then identify issues or topics on which the viewpoints differ. We run our entailment model for all parent claim nodes again in both directions over these claims and identify nodes that contradict each other with probability over 0.33, based on manual analysis of the resulting graphs. We greedily add edges to maintain a tree structure, joining these nodes to a special node, which we call the Issue node. All Issue nodes, as well as claims which are not connected to any Issue node, are connected to  a dummy 'Conversation Node' which serves as the root of the argument graph. We show an example Issue subgraph for NYT data in Figure 1.
Argument Graphs to Summaries Recent work has shown the strength of text-based pretrained models on graph-to-text problems (Ribeiro et al., 2020). Following that work, we linearize the graph by following a depth-first approach starting from the Conversation Node. We found that inserting special tokens to signify edge types did not improve performance, likely due to the size of our data, and simply make use of an arrow → to signify the relationship between sentences. We train a sequence-to-sequence model on our linearized graph input, which we call -arg-graph.

Experimental Settings
We use the fairseq codebase (Ott et al., 2019) for our experiments. Our base abstractive text summarization model is BART-large (Lewis et al., 2020), a pretrained denoising autoencoder with 336M parameters that builds on the sequence-to-sequence transformer of Vaswani et al. (2017). We finetune BART using a polynomial decay learning rate scheduler with Adam optimizer (Kingma and Ba, 2015). We used a learning rate of 3e-5 and warmup and total updates of 20 and 200, following previous few-shot transfer work (Fabbri et al., 2020a). We could have equally fine-tuned other pretrained models such as Pegasus (Zhang et al., 2019) or T5 (Raffel et al., 2019), but Fabbri et al. (2020a find that BART largely performs equally well in few-shot settings when compared to Pegasus. For the NYT and Stack datasets, which contain sequences over the typical 1024 max encoder length with which BART is trained, we copied the encoder positional embeddings to allow sequences up to length 2048. To address the input-length of meeting summaries, which range from 6k to 12k tokens, we use the Longformer (Beltagy et al., 2020), which allows for sequences up to length 16k to-  Table 6: ROUGE-1/2/L results for DDA- GCN (Feng et al., 2020) and HMNet (Zhu et al., 2020) on the AMI and ICSI meeting summarization dataset along with our Longformer and Longformer-arg models. kens. We initialize the Longformer model with BART parameters trained on the CNN-DailyMail dataset, as the meeting summarization datasets contain fewer than 100 data points. We otherwise fine-tune models from vanilla BART, following intuition in few-shot summarization (Fabbri et al., 2020a) and based on initial experiments. In the tables which follow, "-arg" refers to any model trained with argument-mining-based input, and we specify which -arg-graph or -arg-filtered settings were used for each dataset below.

Results
We provide results for baseline, unsupervised extractive models in Table 4. Lexrank (Erkan and Radev, 2004) and Textrank (Mihalcea and Tarau, 2004), and BERT-ext (Miller, 2019), which makes use of BERT (Devlin et al., 2019). The unsupervised extractive models perform well below the extractive oracle performance, suggesting the difficulty of content selection in this setting.
We train BART on 200 examples from our validation set for abstractive models, using the remaining 50 as validation and test on the final test set of 250 examples. We tested zero-shot transfer from CNNDM and SAMSum in zero-shot settings, although these resulted in a much lower performance of about 28 ROUGE-1. Few-shot model performance is shown in Table 5. The abstractive model performs at or above the Extractive Oracle, suggesting the need for better abstractive models.
We also train on our argument mining-based approaches and show results in Table 5. We see ROUGE improvements when applying BART-arggraph for Reddit, and Stack data. The -arg-filtered variation (which, as defined in Section 4, is the less noisy version of the input produced by the argument extraction step) outperformed the -arg-graph variation on both email and NYT data. For email data, however, this did not improve upon the BART baseline, likely due to the dataset's characteristics; email data is shorter and more linear, not benefiting   (Ulrich et al., 2008), debate discussion forums (ADS) (Misra et al., 2015), and news comments (SENSEI) (Barker et al., 2016b).
from modeling the argument structure or removing non-argumentative units. We provide full results for both variations in the Appendix.

Benchmarking Other Conversation Summarization Datasets
We benchmark our models on widely used meeting summarization datasets. Due to the input's linear nature and the size of the meeting transcripts, we found improved results using -arg-filtered to filter non-argumentative units rather than incorporating the graph structure. Results are shown in Table 6. The Longformer model performs as well or better than previous state-of-the-art results on these datasets, despite not making use of more complex modeling structures, and we generally see improvement with argument-mining. As noted above, there exist prior datasets for dialogue, community question answering, email, forum, and news comments summarization. We benchmark results on these datasets in Table 7. We outperform prior work on SAMSum (Gliwa et al., 2019b), and CQASUMM (Chowdhury and Chakraborty, 2018) with our BART and BART-arggraph models, respectively. We did not find improvement on SAMSum with the BART-arg model due to the extremely short and focused nature of the dialogues, analogous to email data performance. We also provide transfer results of BART and BART-arg-graph models from our email and news-comment data to BC3 (Ulrich et al., 2009), ADS (Misra et al., 2015), and SENSEI data (Barker et al., 2016b), for which no prior neural abstractive summarization results existed.

Human Evaluations
We collect human judgment annotations for two of the four quality dimensions studied in Kryscinski et al. (2019) andFabbri et al. (2020b), namely consistency and relevance. Consistency is defined as the factual alignment be-  tween the summary and the summarized source text, while relevance is defined as the summary's ability to select important content; only relevant information and viewpoints should be included. We did not include fluency as an initial inspection of the data found fluency to be of very high quality, as has shown to be the case for pretrained models in news summarization (Fabbri et al., 2020b). We did not include coherence as this was generally not an issue of concern in the initial analysis. We randomly select 25 random examples from the Reddit corpus and ten examples from the AMI corpus, and output from the BART and BART-arggraph models. These data points were chosen to demonstrate what characteristics are realized in differences across ROUGE for argument-graph and argument-noise-reduction approaches. Ten examples were chosen from AMI due to the size of the input and annotation constraints. The annotator sees the source article and randomly-ordered output from the model and then rates the summaries for relevance and consistency on a Likert from 1 to 5, with 5 being the best score. We averaged the score of three native English-speaking annotators on each example and then across examples. Results are shown in Table 8. We find that the annotators prefer our argument mining-based approaches in both dimensions. However, the results are close. Furthermore, the scores for relevance and consistency are rather low, especially on the Reddit dataset and when compared to results on the CNN-DailyMail Dataset from Fabbri et al. (2020b). These results demonstrate the difficulty of modeling such conversational data. Examples are included in the appendix.

Conclusion
We propose ConvoSumm, a benchmark of four new, crowdsourced conversation datasets and stateof-the-art baselines on widely-used datasets that promote more unified progress in summarization beyond the news domain. Our benchmark consists of high-quality, human-written summaries that call for abstractive summaries and a deeper understand-ing of the input texts' structure. We provide results for baseline models and propose to model the text's argument structure, showing that such structure helps better quantify viewpoints in non-linear input in both automatic and human evaluations. Our analysis notes challenges in modeling relevance and consistency in abstractive conversation summarization when compared to news summarization.

Ethical Considerations
As we propose novel conversation summarization datasets and modeling components, this section is divided into the following two parts.

New Dataset
Intellectual Properties and Privacy Rights All data for our newly-introduced datasets are available online; please see the following for New York Times comment data 2 , StackExchange data 3 , and W3C email data 4 . Reddit data is available via the Google BigQuery tool 5 .
Compensation for Annotators We compensated the Turkers approximately $12-$15 per hour. We first annotated examples in-house to determine the required annotation speed. Typically, the summarization task took around 10 minutes, and we compensated the workers from $2.25 to $3.00 per task, depending on the domain and deadline requirements.
Steps Taken to Avoid Potential Problems We interacted closely with the Turkers to ensure that compensation was fair and that the instructions were clear. To maintain the quality of the dataset, we manually reviewed the crowdsourced summaries for language use. Initial investigation into Reddit data showed certain inappropriate language usage, so we filtered these examples automatically.

NLP Application
Bias Biases may exist in the datasets, such as political bias in the news datasets and gender bias in potentially all of the datasets. Thus, models trained on these datasets may propagate these biases. We removed data with offensive language when possible.
Misuse Potential and Failure Mode When used as intended, applying the summarization models described in this paper can save people much time. However, the current models are still prone to producing hallucinated summaries, and in such a case, they may contribute to misinformation on the internet. Further research is needed to ensure the faithfulness of abstractive summaries to address this issue, as this issue is present among all current abstractive summarization models.

Environmental Cost
The experiments described in the paper make use of V100 GPUs. We used up to 8 GPUs per experiment (depending on the experiment; sometimes, a single GPU was used to run the maximum number of experiments in parallel). The experiments may take up to a couple of hours for the larger datasets. Several dozen experiments were run due to parameter search, and future work should experiment with distilled models for more light-weight training. We note that while our work required extensive experiments to draw sound conclusions, future work will be able to draw on these insights and need not run as many large-scale comparisons. Models in production may be trained once for use using the most promising settings.

B Sample Output
We provide examples of model outputs to offer more insight into the datasets and models. An example of Reddit input and outputs for which the models remain faithful to the source is found in Table 10. The gold summary balances being a meta-analysis of the input documents with providing sufficient details. We provide an additional example of outputs that struggle with consistency and relevance in Table 11. In the BART output, the model mistakes the suggestion in the input to pay debt before starting a business. In BART-arg, the model incorrectly determines relevance, as the suggestion that one should invest in pumpkins was sarcastic and not emphasized in the input. This  output points to a need to better model interactions and salience in the conversation data.