EmailSum: Abstractive Email Thread Summarization

Recent years have brought about an interest in the challenging task of summarizing conversation threads (meetings, online discussions, etc.). Such summaries help analysis of the long text to quickly catch up with the decisions made and thus improve our work or communication efficiency. To spur research in thread summarization, we have developed an abstractive Email Thread Summarization (EmailSum) dataset, which contains human-annotated short (<30 words) and long (<100 words) summaries of 2,549 email threads (each containing 3 to 10 emails) over a wide variety of topics. We perform a comprehensive empirical study to explore different summarization techniques (including extractive and abstractive methods, single-document and hierarchical models, as well as transfer and semisupervised learning) and conduct human evaluations on both short and long summary generation tasks. Our results reveal the key challenges of current abstractive summarization models in this task, such as understanding the sender’s intent and identifying the roles of sender and receiver. Furthermore, we find that widely used automatic evaluation metrics (ROUGE, BERTScore) are weakly correlated with human judgments on this email thread summarization task. Hence, we emphasize the importance of human evaluation and the development of better metrics by the community.


Introduction
As one of the major natural language generation tasks, automatic summarization has been studied for decades. Most research efforts were focused on single-document summarization tasks, e.g., news document summarization (Hermann et al., 2015;Narayan et al., 2018). However, living in an information era, we are facing with diverse content Email Thread: Subject: lunch this week Susan: All, Regarding our lunch this week to celebrate the one year anniversaries for Michelle & David, and Mark's birthday, I have a request to make it Wednesday instead of Tuesday. Does anyone have an objection to this? Susan David: I have another lunch engagement Wed, but I will skip it if everyone else wants to move our lunch. David Tamra: Susan, Wednesday works out better for me as well. I have a doctor's appointment tomorrow during lunch. Tamra

Short Summary:
Susan emails everyone about an anniversary and offers to change the date. David says he is busy but is willing to go with the majority. Tamra agrees with Susan's date.

Long Summary:
Susan emails everyone about a lunch to celebrate a one year anniversary as well as Mark's birthday. She says she would change the date to a different day. David says he is busy that day with his own appointment but is willing to go with the majority and cancel that appointment to make this one. Tamra agrees with Susan's date as she is busy Tuesday with an appointment. in different structures. The summarization need is varied along with different application scenarios. Recently, there is an increasing research interest in diverse summarization tasks (Gao et al., 2020), e.g., timeline (Allan et al., 2001), query-based (Li and Li, 2014), multi-modal (Zhu et al., 2018), meeting (Carletta et al., 2006), dialogue or discussion thread (Misra et al., 2015;Gliwa et al., 2019;Rameshkumar and Bailey, 2020), etc. Following the branch of dialogue or thread summarization, we introduce a new abstractive Email Thread Summarization (EMAILSUM) dataset.
Email threads are widely used at work. An email thread is a special type of dialogue that usually has a specific structure (sender, receiver, greeting line, main body, and the signature), contains technical information, and involves multiple speakers. Unlike a conversational dialog turn, an email in a thread is much longer with longer sentences, multiple action items or requests, and stylistically similar to written text. Studies have shown that on average a worker sends/receives 122 business emails (Radicati, 2015) and spends more than 3 hours on those emails (Adobe, 2019) per day. One possible reason is that sometimes people have to read through the entire conversation before replying to the latest email. This happens when you forget the main points of previous discussions or you are newly included in a discussion thread. Therefore, automatically summarizing email threads can improve our work efficiency and provides practical benefits. Email Thread Summarization is not a new task. Carenini et al. (2007) collected extractive summaries of 39 email threads from Enron email corpus (Klimt and Yang, 2004) and proposed to use a fragment quotation graph and clue words to conduct summarization. Ulrich et al. (2008) collected both extractive and abstractive summaries of 40 threads from W3C email corpus (Craswell et al., 2006) plus speech acts, meta sentences, etc. However, this task has been much less studied compared to other summarization tasks, partially due to the lack of large labeled email thread datasets.
In this paper, we collect human-written short (< 30 words) and long (< 100 words) abstractive summaries of 2,549 email threads constructed from Avocado Research Email Collection (Oard et al., 2015), which is 64× the size of previously labeled email thread datasets (Carenini et al., 2007;Craswell et al., 2006). We limit each thread to a minimum of 3 and a maximum of 10 emails, an example is given in Table 1. We also extract 8,594 unlabeled email threads from both Avocado and W3C to facilitate semi-supervised learning. 2 See Section 2 for details of data collection.
Next, we present comprehensive baselines from different learning paradigms as a benchmark for our new email summarization dataset. Specifically, we explore different summarization techniques, including extractive and abstractive summarization methods, single-document and hierarchical models, transfer learning, and semi-supervised learning for both short and long summary generation. Experiments demonstrate that utilizing pretrained language model (e.g., T5 (Raffel et al., 2020)) is critical due to the small size of our data; taking the email thread as a single document sets up a good baseline; transferring from news or dialogue datasets barely improve the performance; using hierarchical encoders only marginally improves it; while semi-supervised learning by using unlabelled email threads significantly (p < 0.01) improves ROUGE (Lin, 2004) scores in some cases.
Lastly, to better understand how well the email thread summarization models perform and investigate the correlation between automatic metrics and human judgment, we ask humans to rate the "salience" (how well the model summarizes salient points) and "faithfulness" (how well the model stays true to the email thread) of model-generated summaries, as well as to perform a pairwise comparison between our best and base models. We find that even though semi-supervised learning improves ROUGE scores, human judges still favor the summary generated by the baseline model (T5 base ). Two frequent errors made by the model are (1) failing to understand the sender's intent and (2) failing to identify the roles of the sender and receiver. Relatedly, human correlation analysis reveals that automatic metrics (ROUGE (Lin, 2004), BERTScore (Zhang et al., 2019)) are poorly correlated with human judgment, which stresses the importance of human evaluation in this task and the requirement for better metrics to be proposed. Overall, in this work, we propose the new EMAIL-SUM dataset that provides a larger resource for studying the email thread summarization task. We conduct a comprehensive empirical model study and human evaluation analysis, which will serve as an important starting point for future studies.

EMAILSUM Dataset
To collect email thread summarization data, we first need to obtain unlabelled email threads. We resort to existing email collections: Enron (Klimt and Yang, 2004), W3C (Craswell et al., 2006), and Avocado (Oard et al., 2015). However, none of them provides explicit thread structure. Therefore, in this section, we will introduce our email thread preprocessing and summary collection procedures.

Email Thread Preprocessing
We extract email threads from the flat email collections in the following steps: (1) we give every email a "normalized subject" by removing the reply or forward tags (e.g., "Re:", "Fwd:", etc.) from its original subject; (2) we group emails by the normalized subjects and sort emails in the same group (i.e.,  (4) we traverse emails in every thread in temporal order and cut off the thread when none of the senders plus receivers of the current email appears in previous emails; (5) we filter out threads that only contain single repeated content.
To obtain a cleaner dataset, we remove threads that do not comply with the following constraints: (1) 3 ≤ the number of emails ≤ 10; (2) 5 < the number of words in each email < 200; (3) 30 < the total number of words < 1000; (4) does not contain non-English (e.g., German) tokens; (5) does not contain reply or forward tags in the subject of the first email.
Emails often contain personal information such as full name, email/physical address, phone number, etc. To protect privacy, we anonymize all email threads before annotation: (1) only keep first names; (2) remove threads that have "password", "pwd", "confidential", etc.; (3) replace email address, physical address, phone number, URL, IP address, local path, and other sensitive numbers with USERNAME@DOMAIN.COM, ADDRESS, PHONENUMBER, HTTP://LINK, IPADDRESS, PATH, and NUMBER, respectively.
We conduct an extensive manual quality scan to make sure that the extracted threads are truly threads (instead of random emails grouped) and properly anonymized. Finally, we obtain 8,116 threads from Avocado and 3,478 threads from W3C. 3 We randomly sample 3K Avocado threads for summary annotation, and the remaining threads are used as unlabelled data.

Thread Summary Collection
We collect summary annotations on Amazon Mechanical Turk. Since summarizing text is not an easy task, to get acceptable English summaries we use several quality control strategies: (1) We select annotators that are located in the US, have an approval rate greater than 97%, and have at least 10,000 approved HITs; (2) During annotation, we periodically sample summaries, manually check their quality, and reject or block poor-quality annotators; (3) After annotation, we randomly sample 2 examples per annotator and manually categorize annotators into "good", "fair", and "bad" groups, then filter examples written by bad annotators.
Email threads oftentimes contain technical information, we instruct annotators not to get stuck on technical details, instead, focus on the major concerns, decisions, and consensus. We collect both short (< 30 words) and long (< 100 words) abstractive summaries per thread. For the short summary, we instruct annotators to write a concise description of what the thread is mainly talking about; while for the long summary, we instruct them to write a a narrative of what happens. We are intent to provide summaries with two different levels of abstractiveness, length, and concreteness. We show annotators an example written by an expert (a CS graduate student). More summary collection details can be found in Appendix A.

Final Dataset Description
The summary collection and filtering process yield 2,549 email threads each with a long and a short summary. We randomly sample 500 examples from the "good" annotator group as our testing set and split the remaining examples into training (1,800 threads) and development (249 threads) sets. Table 2 shows the statistics of EMAILSUM. 4 . For ease of benchmarking, we also include statistics on other commonly used summarization datasets: CNN/DM (Hermann et al., 2015) and XSum (Narayan et al., 2018) are about news summarization; SAMSum (Gliwa et al., 2019) is about chit-chat summarization; CRD3 (Rameshkumar and Bailey, 2020) is a role-play dialogue summarization dataset; BC3 (Ulrich et al., 2008) is another email thread summarization with 40 threads from W3C. Compared to the other datasets, the average document length in the EMAILSUM dataset is not very long, containing 233 words; long summaries are more than twice as longer than short summaries. "Ext-Oracle-R1" in Table 2 indicates how abstractive the summaries are. It computes the ROUGE-1 scores of an oracle extractive method (see Section 3.1 for details of the oracle extractive method). The lower it is, the more abstractive the dataset is. According to this score, the abstractiveness of the EMAILSUM summaries is lower than the XSum summaries, while higher than the CNNDM summaries. Furthermore, the short summaries of EMAILSUM dataset are more abstractive than its long summaries.

Models
The summarization models we explore in this work take the email thread as input and generate the summary as output. We experiment on EMAIL-SUM short and EMAILSUM long tasks separately.

Extractive
Oracle. This method maximize an evaluation metric w.r.t. the gold summary. "Ext-Oracle-R1" in Table 2 is computed from an oracle summary that maximizes ROUGE-1 (Lin, 2004).
Lead. This model simply picks the first sentence from the source document as the summary, which has surprisingly good performance on CNN/DM dataset (Narayan et al., 2018). We test two variants by selecting: (1) the first sentence of the email thread, which is usually the subject (see the example in Table 1), referred as Lead-1; (2) the first sentence of the email thread (the subject) plus the first sentences of every email, named Lead-1-Email. 5 TextRank. This is a graph-based method (Mihalcea and Tarau, 2004). It first builds a graph between sentences by their embedding similarities; then the PageRank algorithm is applied to obtain the rank 5 We also tested some other heuristics: e.g., the first sentence of the last email, the last 3-5 sentences of the email thread, etc. However, none of them perform better than Lead-1-Email. scores for each sentence, and top-rank sentences are selected as the summary.
BertSumExt. Liu and Lapata (2019b) propose to build a sentence extractor upon BERT (Devlin et al., 2019) to perform extractive summarization, which achieves a good performance on CNN/DM.

Abstractive
Fast Abs RL. As the simple non-pretrained abstractive baseline, we use Chen and Bansal (2018), which is a hybrid model that first extracts sentences from the source document, then rewrites the extracted sentences by an abstractive rewriter. They pair summary sentences with the extracted sentences to train the abstractive rewriter. Adapting their model to our email thread summarization task, we make two adjustments: (1) We extract emails instead of sentences, which is a natural unit for email thread; (2) Since summary sentences usually follow the temporal order of the emails, we enhance this pairing procedure by using the Neeleman-Wunsch algorithm (Needleman and Wunsch, 1970;Rameshkumar and Bailey, 2020) to impose the order constraint to the alignment (see description and comparison in Appendix B).
T5. T5 (Raffel et al., 2020) is a Transformer (Vaswani et al., 2017) based seq-to-seq model pretrained with large-scale English data. It achieves state-of-the-art performances on a lot of NLP tasks including the CNN/DM summarization task. As our main baseline, we take the email thread as a single document and finetune a T5 base to generate the summary (T5 base ). A similar setup is also used in transfer and semi-supervised learning. Since our training dataset is small, we find that using the pretrained knowledge transfer is crucial. Training a T5 model from scratch performs poorly (see the results in Appendix Table 7).
Transfer Learning. To analyze how information from other summarization datasets (listed in Table 2) can be transferred to this new task and its impact on the performance, we investigate two simple transfer learning methods: (1) Pre-finetuning, in which we first finetune T5 on a bigger summarization dataset (e.g., CNN/DM) then continue the finetuning on our dataset, referred as X pre (X is the bigger dataset's name, e.g., CNNDM pre ) in our result tables. This is analogous to the continual training method proposed for multilingual transfer learning of machine translation (Kocmi and Bojar, 2018).
(2) Joint-training, in which we upsample EMAILSUM data and mix it with another dataset, then use the combined data to finetune T5, similarly denoted as X joint . This is analogous to the multilingual joint training method used in machine translation (Johnson et al., 2017).
Semi-supervised learning. Since we only have 2.5K labeled email threads, another important technique to improve the performance is to utilize unlabelled data (i.e., email threads without labeled summaries). As introduced in Section 2.1, in addition to the 3K email threads used for summary collection, we have 8,594 unlabelled email threads (5,116 from Avocado; 3,478 from W3C). We explore semi-supervised learning via the simple self-training technique (Scudder, 1965). We use a trained model (a finetuned T5) to generate summaries for unlabelled threads, then mix the model-labeled and human-labeled data to finetune T5 again, referred as SemiSup x (x stands for the unlabelled data source we use, i.e., W3C, Avocado, or together).
Hierarchical T5. Hierarchical summarization models have been shown to improve the performance of multi-document summarization task (Liu and Lapata, 2019a). Although an email thread can be treated as a single document due to the temporal dependency between consecutive emails, it also has a clear turn structure that encourages using of the hierarchical encoders. Recently, Zhu et al. (2020) proposed a hierarchical model (HMNet) for meeting summarization. Inspired by their work, we propose a hierarchical model that is similar to HMNet in structure but uses T5 as the backbone, therefore, it can take advantage of both the hierarchical structure and the pre-trained knowledge. As shown in Figure 1, this model contains two encoders: the token-level encodes the whole email thread (e.g., e 1 , e 2 , e 3 , e 4 ) while the email-level receives mean-pooled email-level representations as input. The decoder has two cross attentions that attend to the outputs of the email-level and the token-level encoders respectively. Both token-level and email-level encoders are sharing the weights of the T5 encoder. We add a small number of new parameters by adding new cross attention between the decoder and the email-level encoder. (2) ROUGE-2 (R2) measures the bi-gram overlap; (2) ROUGE-L (RL) computes the longest common subsequence (LCS); (4) summary-level ROUGE-L (RLsum) computes LCS between each pair of reference and candidate sentences and returns the union-LCS. We use the rouge score package 7 and report F1 scores.
BERTScore (Zhang et al., 2019) goes beyond n-gram overlap to provide contextualized semantic similarity. Specifically, it uses BERT (Devlin et al., 2019) (or RoBERTa (Liu et al., 2019)) representations to "softly" align the words in candidate and reference summaries and then computes a "soft" uni-gram F1 score. We use the bert score package 8 and report rescaled numbers with a baseline. Table 3 shows the evaluation results on the testing set of different models (the corresponding results on the development set can be found in Appendix Table 7). It can be observed that the Oracle extractive model sets up a high upper bound on all metrics except for BERTScore (BertS). Among non-oracle extractive methods, the Lead-1-Email heuristic works best and even better than the deep extractive method, BertSumExt. The hybrid Fast Abs RL model outperforms purely extractive methods but works worse than purely abstractive methods with large-scale pretraining (e.g., T5). 6 The significance test is following the bootstrap test setup (Efron and Tibshirani, 1994)    Taking the email thread as one single document and finetuning T5 (i.e., T5 base in Table 3) sets up a strong baseline. Upon this baseline model, we test the transfer learning from four different summarization datasets (CNN/DM, XSum, SAMSum, and CRD3). However, as shown in Table 3, transfer learning barely improves over baseline, and transferring by pre-finetuning always works better than joint-training. Since our EMAILSUM has a quite different domain as existing news or dialogue datasets, we conjecture that it is hard to transfer knowledge between them or better transferring techniques need to be applied. Similarly, we test the semi-supervised learning with unlabelled data from W3C, Avocado, and both of them (together). This method can mostly (or significantly in some cases) outperform the baseline's performance for both EMAILSUM short and EMAIL-SUM long . Lastly, the hierarchical T5 base model only marginally outperforms the non-hierarchical Figure 2: The impact of the number of emails in the thread on summarization performance (ROUGE-1). The results are on the testing set. short/long denotes EMAILSUM short /EMAILSUM long ; base/best denotes the baseline/best model. baseline for EMAILSUM long task. It is notable that overall EMAILSUM long has higher ROUGE scores but lower BERTScore than EMAILSUM short .

Results
Since we focus on generating abstractive summaries for email threads and the human-written summaries are fairly abstractive (as shown in Table 2), we further investigate the abstractiveness of model-generated summaries. We take summaries generated by the baseline (T5 base ) and the best ROUGE-1 models (SemiSup together for EMAIL-SUM short , SemiSup w3c for EMAILSUM long ) as the pseudo ground-truth, respectively. Then, we evaluate the ROUGE-1 of extractive Oracle and Lead-1-Email models; higher scores means more extractive summaries. As shown in Table 4 Tie  Salience  109  133  55  109  130  50  Faithfulness  116  123  58  126  122  41  Overall quality 120  138  39  125  140  24   Table 5: Pairwise comparison between summaries generated by the best ROUGE-1 models and T5 base .
to humans, models generate much more extractive summaries. Moreover, the semi-supervised models (R1-best) are even more extractive than the baseline, which is probably because the self-training procedure amplifies the extraction tendency. Lastly, for both base and best models as well as for both short and long summaries, the model performance (ROUGE-1) decreases as the number of emails in the thread increases (shown in Figure 2).

Human Rating Collection
To better understand where the model still falls short and investigate if the automatic metrics correlate well with human judgments, we conduct a human evaluation on Amazon Mechanical Turk. Initially, by manually checking the quality of modelgenerated summaries, we find that models can mostly generate grammatical, relevant, and fluent summaries; however, they often fail to be salient and faithful, i.e., models tend to be overdetailed or do not stay true to the source thread. Therefore, we ask human annotators to rate the "salience" and "faithfulness" of model-generated summaries. We choose the best ROUGE-1 models, SemiSup together for EMAILSUM short , SemiSup w3c for EMAILSUM long , to evaluate, then we sample 100 examples, and collect 3 responses for each example. Human judges are asked to rate on a 5-point Likert scale for salience and faithfulness respectively and annotate which summary sentences are not salient or unfaithful. We explain the meaning of "salience" and "faithfulness" to annotators and instruct them how to rate from 1 to 5. Meanwhile, to verify the improvement obtained by best R1 models over T5 base , we ask them to compare the summaries generated by these models and those from T5 base , and judge which one is more salient, more faithful, and has overall higher quality. More collection details can be found in the Appendix D.
We check the average inter-rater agreement (Krippendorff's alpha (Krippendorff, 2011)) of "salience" and "faithfulness" ratings. It is around 0.09 to 0.23, i.e., slight to fair agreement (Fleiss and Cohen, 1973). However, when we convert the ratings to 3-point by taking {3}, {4 and 5}, {1 and 2} as 3 classes, the agreement increases to 0.36 to 0.63, i.e., fair to substantial agreement. This indicates that humans' subjectivity affects the ratings and people have a hard time distinguishing 'bad' from 'very bad' as well as 'good' from 'very good'. Meanwhile, the ratings for short summaries are always less agreed across raters (0.36-0.38) than that for long summaries (0.58-0.63). This indicates that there might be multiple different ways of summarizing an email thread into a short summary. The agreement of pairwise comparison is around 0.20 to 0.24 (fair agreement), which is because the baseline and the best models have non-distinguishable performance (shown in Table 5). Finally, we take the 3-rater average as the final human rating for each example.
In addition, we evaluate the correlations (Pearson Correlation (Benesty et al., 2009)) among different human ratings. The correlation between salience and faithfulness ratings is 0.36/0.45 for short/long summarization. And the correlations among salience, faithfulness, and overall quality pairwise preferences are around 0.53 to 0.79. Overall, moderate to large (Cohen, 2013) correlations are observed.

Generated Summary's Quality Analysis
Surprisingly, human evaluators are mostly satisfied with the salience and faithfulness of modelgenerated summaries, ratings are around 4 out of 5. On average, humans rate 3.89 and 4.04 for the salience and faithfulness of SemiSup together generated short summaries, respectively; and they rate 4.22 and 4.29 for the salience and faithfulness of SemiSup w3c generated long summaries, respectively. Examples with low or high ratings are shown in Table 6 or Appendix Table 8. Humans rate higher for model-generated long summaries, which is correlated to the trend of ROUGE, and they are more satisfied with faithfulness than salience.  son between the best ROUGE-1 models and T5 base . Except for the faithfulness of EMAILSUM long , the best ROUGE-1 models mostly lose to the baseline (though the loss and win are mostly marginal). Together with Table 4, we conjecture that the improvement obtained by semi-supervised learning exploits n-gram matching accuracy by making the summary more extractive, while humans prefer more abstractive summaries. Lastly, we analyze the non-salient and unfaithful sentences labeled by the human evaluators. We find that two errors are frequently made by the summarization model: (1) Failing to understand the sender's intent. Usually, when we send an email, there is a high-level intention behind the detailed content we write, e.g., start up a discussion, bring up a concern, broadcast a decision, etc. However, models are oftentimes unable to capture the intention and thus overly focus on details. As shown in the first example of Table 6, Om intends to summarize the important points from a meeting, while the model only picks the first piece of detail in that email as the summary. This problem is also related to the over-extractive issue (shown in Table 4). The model tends to extract details from the source thread and the extraction is biased to the first sentence of each email.
(2) Failing to identify the roles of the sender and receiver. An email thread is a special type of conversation with multiple speakers involved. One important task Pearson Coefficient Figure 3: Correlation between automatic metrics and human judgements. Short and Long refer to EMAIL-SUM short and EMAILSUM long tasks, respectively. for the model is to identify the roles of different speakers and their relations, i.e., who does what to whom. As shown in the second example of Table 6, the model wrongly takes "2 fixes in 382 are in the patch installer" as information provided by Nilesh, whereas it is supposed to be by Diana. The same issue can also be observed in the first example: Om is just summarizing what Nihar said instead of telling Nihar. This is considered as a type of unfaithfulness, which has been widely identified as a common issue of abstractive summarization models (Wang et al., 2020;Durmus et al., 2020;Maynez et al., 2020).

ROUGE (Lin, 2004) measures n-gram overlap
and BERTScore (Zhang et al., 2019) is essentially based on "soft" uni-gram matching. However, according to our analysis presented above, the email thread summarization models mainly fail to be abstractive, salient, and faithful, which are hard to be evaluated by n-gram overlap. Furthermore, as pointed out by Bhandari et al. (2020), different datasets usually require different evaluation metrics. Therefore, here, we study the correlation between automatic metrics and human judgments.
Specifically, we evaluate the Pearson Correlation between human ratings and automatic metric scores on the 100 examples used in the human evaluation. Besides, as described above, we conduct a pairwise model comparison between the best ROUGE-1 models and T5 base for "salience", "faithfulness", and "overall quality". We convert them to a pairwise ranking score, i.e., -1 if T5 base is better; 1 if T5 base is worse; 0 if two models are non-distinguishable. In the same way, we convert different metric scores to ranking scores. Then, we also evaluate the Pearson Correlation between human and metric ranking scores. Figure 3 illustrates the results. Overall, the correlations are fairly poor. The best correlation is between ROUGE-1 and human overall quality ranking for short summary generation (coefficient=0.14, p=0.16). There is little or negative correlation between metrics and human judgment for the long summary generation. Therefore, we emphasize the importance of human evaluation and better automatic proxies need to be proposed in the future.

Conclusion
In this work, we propose an abstractive email thread summarization dataset, EMAILSUM, that contains 2,549 email threads with human-written short and long summaries. We explore different summarization paradigms and find that taking the email thread as a single document and finetuning T5 (Raffel et al., 2020) sets up a good baseline. Transferring from other summarization datasets barely improves it. Using hierarchical structure also only marginally improves the performance. Semi-supervised learning by using unlabelled email threads improves automatic metrics (ROUGE) but still loses to the baseline in human evaluation. Finally, our human evaluation reveals that the model fails to understand the sender's main intention and the roles of different speakers. Automatic metrics are poorly correlated with human judgment, which emphasizes the importance of human evaluation and designing new metrics for this task in the future.

Broader Impact Statement
We use two email collections in this work: Avocado (Oard et al., 2015) and W3C (Craswell et al., 2006). W3C is derived from W3C Public Mailing List that is open-source available online. Avocado consists of emails and attachments taken from 279 accounts of a defunct information technology company referred to as "Avocado". Its copyright is protected by Linguistic Data Consortium. Based on the license agreement, we will only open-source our collected summaries and provide scripts to obtain email threads from the original Avocado email collection. To further protect copyright and the privacy of the persons involved in the emails, as introduced in Section 2, we carefully anonymize all the email threads we construct from both email collections. We fairly pay crowd-source workers $1.37 (for threads with 5 or fewer emails) or $2 (for threads with more than 5 emails) for writing the short and long summaries and $0.6 for human rating such that the pay rate is higher than the federal minimum wage requirement. Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. 2009. Pearson correlation coefficient. In Noise reduction in speech processing, pages 1-4. Springer.

B Fast Abs RL
The original Fast Abs RL method (Chen and Bansal, 2018) uses ROUGE-L recall to align extracted source sentences and target summary sentences. In our case, we extract emails and align them with summary sentences. Since the emails and summary sentences usually follow the same temporal order, we enhance the alignment procedure by the Neeleman-Wunsch algorithm (Needleman and Wunsch, 1970;Rameshkumar and Bailey, 2020) to imposing strict order constraints, e.g., there should not be "email i is aligned to sentence j while email i+1 is aligned to sentence j−1 " cases.
Meanwhile, we modify it to allow one email to be aligned with multiple summary sentences but avoid one summary sentence aligning with multiple emails. Specifically, we first obtain the similarity matrix M of size n e × n s between each email and summary sentence by ROUGE-L recall (n e is the number of emails, n s is the number of summary sentences); then the alignment score matrix H of size (n e +1)×(n s +1) is initialized as all-zero then computed as follows for 1 ≤ x ≤ n e , 1 ≤ y ≤ n s : H x,y = max    H x−1,y−1 + M x−1,y−1 H x,y−1 + M x−1,y−1 H x−1,y Then we traceback from H ne,ns to H 0,0 to obtain the final alignment. As shown in Table 7, the "Fast Abs RL (default)" model refers to this method with the default setting which works mostly worse than our enhanced Fast Abs RL.

C Experimental Details & Additional Results
We implement the TextRank (Mihalcea and Tarau, 2004) model via the summa python package 9 and set the summarization ratio as the average summary length thread length ratio in the training set, which is 0.22 for short summary and 0.38 for long summary.
We test Fast Abs RL (Chen and Bansal, 2018) via the author's open-source code. 10 Most of our models are built on T5 (Raffel et al., 2020) and we use the base version that has 220 million parameters. Our hierarchical T5 shares the same T5 encoder parameters between the token-level and email-level encoders. The only new parameters added are from the first cross attention between decoder and emaillevel encoder. We use Transformers (Wolf et al., 2020) 11 to run all the T5 based models. We run experiments on a single Tesla V100 GPU. We set the max input sequence length as 512 tokens and max output length as 56 tokens during training (200 tokens during evaluation). The total batch size (with gradient accumulation) is 128. The learning rate is 5e-4, except for training the T5 base from scratch, we use 1e-4 instead. Since our training set only contains 1.8K examples, it only takes 2-4 minutes per epoch. We train models for 70 epochs.
Our model selection is based on each of the five evaluation metrics, ROUGE-1/ROUGE-2/ROUGE-L/summary-level ROUGE-L/BERTScore. We select the best checkpoints for each of the five metrics on our development set, then test those checkpoints on the testing set to report the final numbers for each metric. Table 7 shows all the results on our development set. Table 8 shows two examples that have high-rating model-generated summaries. Figure 5 shows the questions we asked to human judges to evaluate the quality of model-generated summaries. Before these questions, we instruct annotators how to rate on a 5-point Likert scale for "salience" and "faithfulness": (1) Rate salience from 1 to 5: 1 is the worst, none of the points in the summary is important enough to be summarized; 5 is the best, all of the points mentioned in the summary are important and worth to be summarized;

D Human Evaluation
(2) Rate faithfulness from 1 to 5: 1 is the worst, all of the sentences in the summary are either wrong or not existing in the email thread; 5 is the best, all of the points mentioned in the summary are true to the thread. Plus, we also prompt examples of "non-salient" and "unfaithful" summaries on the webpage. We pay annotators $0.60 per HIT.