ECTSum: A New Benchmark Dataset For Bullet Point Summarization of Long Earnings Call Transcripts

Despite tremendous progress in automatic summarization, state-of-the-art methods are predominantly trained to excel in summarizing short newswire articles, or documents with strong layout biases such as scientific articles or government reports. Efficient techniques to summarize financial documents, discussing facts and figures, have largely been unexplored, majorly due to the unavailability of suitable datasets. In this work, we present ECTSum, a new dataset with transcripts of earnings calls (ECTs), hosted by publicly traded companies, as documents, and experts-written short telegram-style bullet point summaries derived from corresponding Reuters articles. ECTs are long unstructured documents without any prescribed length limit or format. We benchmark our dataset with state-of-the-art summarization methods across various metrics evaluating the content quality and factual consistency of the generated summaries. Finally, we present a simple yet effective approach, ECT-BPS, to generate a set of bullet points that precisely capture the important facts discussed in the calls.


Introduction
Earnings Calls, typically a teleconference or a webcast, are hosted by publicly traded companies to discuss important aspects of their quarterly (10-Q), or annual (10-K) earnings reports, along with current trends and future goals that help financial analysts and investors to review their price targets and trade decisions (Givoly and Lakonishok, 1980;Richard Frankel and Skinner, 1999;Bowen et al., 2002;Keith and Stent, 2019).The corresponding call transcripts (called Earnings Call Transcripts, abbreviated as ECTs) are typically in the form of long unstructured documents consisting of thousands of words.Hence, it requires a great deal of time and effort, even on the part of trained analysts, to quickly summarize the key facts covered in these transcripts.Given the importance of these calls, they are often summarized by media houses such as Reuters and BusinessWire.The scale of such effort, however, calls for the development of efficient methods to automate this task which in turn necessitates the creation of a benchmark dataset.
Towards this goal, we present ECTSum, a new benchmark dataset for bullet-point summarization of long ECTs.As discussed in Section 3.2, first we crawled around 7.4K ECTs from The Motley Fool3 , posted between January 2019 and April 2022, corresponding to the Russell 3000 Index companies4 .Reuters was chosen to be the source of our target summaries, per consultation with domain experts, since the expert-written articles posted on Reuters effectively capture the key takeaways from earnings calls.However, searching for Reuters articles corresponding to the collected ECTs was especially challenging, since the task was non-trivial.Given the fact that not all calls are tracked, after carefully performing data cleaning and addressing pairing issues, we arrive at a total of 2,425 documentsummary pairs as part of the dataset.
What makes ECTSum truly different from others is the way the summaries are written.Instead of containing well-formed sentences, the articles contain telegram-style bullet-points precisely capturing the important metrics discussed in the earnings calls.A sample reference summary from our dataset corresponding to the 2nd quarter 2022 earnings call of Apple is shown in Table 1.There are several other factors that make ECTSum a challenging dataset.First, the document-to-summary compression ratio of 103.67 is the highest among existing long document summarization datasets with comparable document lengths (Table 2).Hence, in order to do well, trained models need to be highly precise in capturing the most relevant facts discussed in the ECTs in as few words as possible.
We benchmark the performance of several representative summarization approaches (Section 5.1) from both supervised and unsupervised paradigms, on our newly proposed dataset.Among supervised, we select state-of-the-art methods from extractive, abstractive, and long document summarization literature.Finally, given the pattern of source transcripts and target summaries, we present ECT-BPS, a simple yet effective pipeline approach for the task of ECT summarization (Section 4).Specifically, it consists of an extractive summarization module followed by a paraphrasing module.While, the former is trained to identify salient sentences from the source ECT, the latter is trained to paraphrase ECT sentences to short abstractive telegram-style bullet-points that precisely capture the numerical values and facts discussed in the calls.
In order to demonstrate the challenges of the proposed ECTSum dataset, competing methods are evaluated on several metrics that assess the content quality and factual consistency of the modelgenerated summaries.These metrics are discussed in Section 5.2.We discuss the comparative results of all considered methods against automatic evaluation metrics in Section 5.4.Given the complex nuances of financial reporting, we further conduct a human evaluation experiment (survey results reported in Section 5.5) where we hire a team of financial experts to manually assess and compare the summaries generated by ECT-BPS, and those of our strongest baseline.Overall, both automatic and manual evaluation results show ECT-BPS to outperform strong state-of-the-art baselines, which demonstrates the advantage of a simple approach.
Our contributions can be summarized as follows: • We present ECTSum, the first long document summarization dataset in the finance domain that requires models to process long unstructured earning call transcripts and summarize them in a few words while capturing crucial metrics and maintaining factual consistency.• We propose ECT-BPS, a simple approach to effectively summarize ECTs while ensuring factual correctness of the generated content.We establish its better efficacy against strong summarization baselines across all considered metrics evaluating the content quality and factual correctness of model-generated summaries.pati et al., 2017;Zhong et al., 2020), abstractive (Zhang et al., 2019;Lewis et al., 2020), as well as long document summarization (Zaheer et al., 2020;Beltagy et al., 2020) have seen tremendous progress over the years (Huang et al., 2020).Several works also exist on controllable summarization (Mukherjee et al., 2020;Amplayo et al., 2021) and, in specific domains, such as disaster (Mukherjee et al., 2022), and legal (Shukla et al., 2022).However, the field of financial data summarization remains largely unexplored, primarily due to the unavailability of suitable datasets.Passali et al. (2021) have recently compiled a financial news summarization dataset consisting of around 2K Bloomberg articles with corresponding human-written sum-maries.However, similar to other popular newswire datasets such as CNN/DM (Nallapati et al., 2016), Newsroom (Grusky et al., 2018), XSum (Narayan et al., 2018), the documents (news articles) themselves are only a few hundred words long, hence limiting the practical importance of model generated summaries (Kryściński et al., 2021).
To the best of our knowledge, FNS (El-Haj et al., 2020) is the only available financial summarization dataset, released as part of the Financial Narrative Summarization Shared Task 2020 5 .In FNS, annual reports of UK firms constitute the documents, and a subset of narrative sections from the reports are given verbatim as reference summaries.However, ECTSum differs from FNS on several accounts.
First, our target summaries consist of a small set of telegram-style bullet-points, whereas the ones in FNS are large extractive portions from respective source documents.Second, ECTSum has a very high document-to-summary compression ratio (refer Section 3.3), because of which the models are expected to generate extremely concise summaries of around 50 words from lengthy unstructured ECTs, around 2.9K words long.In contrast, the expected length of model-generated summaries on FNS is around 1000 words.Finally, the models developed on FNS are specifically trained to identify and summarize the narrative sections, while completely ignoring others containing facts, and figures that reflect the firm's annual financial performance.Excluding these key performance indicators from summaries limits their practical utility to stakeholders.Models trained on ECTSum, on the other hand, are specifically expected to capture salient financial metrics such as sales, revenues, current trends, etc. in as few words as possible.
Previously, Cardinaels et al. (2018) had attempted to summarize earnings calls using standard unsupervised approaches.We are however the first to propose and exhaustively benchmark a large scale financial long document summarization dataset involving earnings call transcripts.

Dataset
This section describes our dataset, ECTSum, including the data sources, and the steps taken to sanitize the data, in order to obtain the documentsummary pairs.Finally, we conduct an in-depth analysis of the dataset and report its statistics.

ECTs of listed companies are publicly hosted on
The Motley Fool6 .We crawled the web pages corresponding to all available ECTs for the Russell 3000 Index companies7 posted between January 2019 and April 2022.In the process, we obtain a total of 7,389 ECTs.The HTML web pages were parsed using the BeautifulSoup8 library.ECTs typically consist of two broad sections: Prepared Remarks, where the company's financial results, for the given reporting period, are presented; and Question and Answers, where call participants ask questions regarding the presented results.We only consider the unstructured text corresponding to the Prepared Remarks section to form the source documents.
Collecting expert-written summaries corresponding to these ECTs was a far more challenging task.Reuters9 hosts a huge repository of financial news articles from around the world.Among these, are articles, written by analysts, that summarize earnings calls events in the form of a few bulleted points (see Table 1).After manually going through several such articles, and after consulting experts from Goldman Sachs, India, we understood that these articles precisely capture the key takeaways10 from earnings calls.Accordingly, using the company codes and dates of the earnings call events corresponding to the collected ECTs, we crawled Reuters web pages to search for relevant articles.We obtained 3,013 Reuters articles in the process.

Data Cleaning and Pairing
Cleaning the ECTs: Almost all earnings calls (and hence the corresponding transcripts) begin with an introduction by the call moderator/operator.We remove these statements since they do not relate to the financial results discussed thereafter.Some calls directly start with the Questions and Answers, in which case we exclude them from the collection.
Cleaning the summaries: For the Reuters (summary) articles, first we performed simple preprocessing to split the text into sentences.In many articles, we observed sentences ending with the phrase REFINITIV IBES DATA.Such sentences report estimates made by Refinitiv11 analysts on the  2021), whereas the ones marked with † are copied from Huang et al. (2021).Numbers which were not reported are left blank.ECTSum has the highest compression ratio among all the datasets while having comparable coverage and density scores.earnings of publicly traded companies.We remove these sentences as they do not correspond to the actual results discussed in the earnings calls (as understood from our discussion with financial experts).
In the process, we make our target summaries factually consistent with the source documents.
Creating Document-Summary Pairs: In order to automate the process of pairing an ECT with its corresponding Reuters article, first we made sure that the article mentions the same company code as the ECT, and second, it is posted either on the same day or at max one day after the earnings event.Please refer to Section A.1 for more details.
After obtaining the automatically-matched pairs, the authors manually and independently crosschecked 200 randomly selected ECT(document)-Reuters(summary) pairs.We found all the pairs to be properly matched.The process thus ensures accuracy at the cost of obtaining a smaller amount of (sanitized) data.The dataset can however be easily extended as future earnings calls are covered by media houses, such as Reuters, and BusinessWire.

Statistics and Analysis
The data cleaning and pairing process described above resulted in a total of 2,425 documentsummary pairs, with average document length of around 2.9K words and average target summary length of around 50 words.We randomly split the data to form the train (70%), validation (10%) and test (20%) sets.In Table 2, we report various dataset statistics, as defined by Grusky et al. (2018), for the ECTSum corpus and compare them with the existing long document summarization datasets.While Coverage quantifies the extent to which a summary is derivative of a text, Density measures how well the word sequence of a summary can be described as a series of extractions.
Our scores of 0.85 (Coverage) and 2.43 (Density) are fairly comparable with other datasets.These indicate that although our target summary sentences are short abstractive texts, they are fairly derivable from the ECT content.Our document-to-summary compression ratio score of 103.67 is overwhelmingly higher than any other dataset.This makes ECTSum challenging to work on, and requires models to be trained in a way so that they can capture relevant information in as few words as possible.Both these factors motivated the design of our proposed approach, ECT-BPS (refer to Section 4).Following prior works (Huang et al., 2021;Kryściński et al., 2021), we further assess whether the target summary content is confined to certain portions of the source document.For this, we plot, in Fig. 1, the percentage distribution of salient unigrams (target summary words excluding stopwords) in four equally sized segments of the source text.We observe that the salient content is evenly distributed across all the four segments of the source documents.This property requires models, trained on ECTSum, to process the entire document in order to generate a high quality summary.

The ECT-BPS Framework
We observe some important properties of the Reuters reference summaries.They contain a high percentage of word overlap with the source ECT documents.However, they are not extractive, rather contain a small set of abstractive bullet-points.It seems as if the analysts writing these summaries first selected some crucial parts of the ECT, before compressing them into a bullet-point format.These properties of the reference summaries motivated us to design a two-stage pipeline approach for summarizing ECTs.Our proposed model ECT-BPS contains two separately trained modules/blocks -(1) an Extractive block that is trained to identify the most relevant sentences from the input ECT document, and (2) a Paraphrasing block that is trained to rephrase the extracted ECT sentences to the format of target (Reuters) sentences, thereby generating a set of bullet points.Figure 2 gives an overview of our proposed architecture.

The Extractive Module
We leverage and suitably modify the architecture of SummaRuNNer (Nallapati et al., 2017) to design our extractive module.The vanilla SummaRuNNer consists of a two-layer bi-directional GRU-RNN.The first layer works at the word-level to learn contextualized word representations, which are then average-pooled to obtain sentence representations.We replace this layer by FinBERT (Yang et al., 2020), a BERT model pre-trained on financial communication text, and use it to obtain the individual sentence representations.The second layer of bidirectional GRU-RNN works at the sentence-level to learn contextualized representations of the input ECT sentences.We then obtain the document representation d using the hidden state vectors of sentences from this second layer of bi-directional GRU-RNN as follows: where h f i and h b i respectively represent the hidden state vectors of the forward, and backward GRUs corresponding to s i , the i th sentence of the input ECT document.W d and b represent the weight and bias parameters, respectively.N d represents the number of sentences in the document.
Each sentence s i is sequentially revisited in a second pass where a classification layer (Fig. 2) takes a binary decision regarding its inclusion in the summary as follows: . sum i represents the intermediate representation of the summary formed till s i is visited.p a i , and p r i respectively represent the absolute and relative positional embeddings corresponding to s i .Please refer to Nallapati et al. (2017) for more details.We add a parameter ν i that is set to 1 if s i contains numerical values, and 0 otherwise.Keeping in mind the nature of the target summary sentences, that predominantly discuss metrics and numbers, ν i guides the classifier to give higher weightage to sentences containing numerical values.Therefore, for each sentence s i , its content f (h i ), salience given the document context f (h i , d), novelty considering the summary already formed (f (h i , sum i ), positional importance, and the fact whether it contains monetary figures, are all taken into account while deciding upon its summary membership.

The Paraphrasing Module
As depicted in Fig. 2, we fine-tune T5 (Raffel et al., 2020) to paraphrase the input ECT sentences to the telegram-style (Reuters) format of target summary sentences.During this paraphrasing, special care is taken to ensure that the numerical values in the input sentences are not rephrased wrongly (hallucinated).More specifically, during training we replace the numerical values in the input sentences with placeholders such as [num-one], [num-two], etc.After obtaining the paraphrased sentences, we replace the placeholders with their original values by performing a simple post-processing step.

Training and Inference
Target Summary for Extractive Module.Corresponding to each sentence (hereby referred to as the 'target sentence') in the reference summary (obtained from Reuters), first we greedily search for a document sentence (using regular expressions) that captures all the numerical values mentioned in the target sentence.In case of multiple matches, we select all such document sentences.If no match is found, we select the document sentence that is most similar to the target sentence, in terms of cosine similarity between their embeddings obtained using Google's Universal Sentence Encoder (Cer et al., 2018).The selected set of document sentences serve as the target summary for training the Extractive Module.We train this module by minimizing the Binary Cross Entropy loss between the predicted and the true sentence labels.
For training the Paraphrasing Module, each sentence in the target summary for the Extractive Module becomes the source while the corresponding reference summary sentence becomes the target.The module is trained by minimizing the Cross-Entropy loss between the predicted and target tokens.
During inference, a test ECT document is sent as input to the trained Extractive Module.Sentences corresponding to the extractive summary thus obtained are paraphrased using the trained Paraphrasing Module to obtain the final summary.

Experiments and Results
In this section, we first enumerate the baselines and evaluation metrics.We then describe our experimental setup, followed by a detailed discussion of our main results.We then report the design and results of a human evaluation experiment conducted to manually assess and compare ECT-BPS-generated summaries with those of competing baselines.We end this section with a qualitative analysis of model-generated summaries.

Evaluation Metrics
1.For evaluating the content quality of modelgenerated summaries, we consider ROUGE (Lin, 2004), and BERTScore (Zhang et al., 2020).We report the F-1 scores corresponding to ROUGE-1, ROUGE-2, and ROUGE-L.2. For assessing the factual correctness of the generated summaries, we consider SummaC CONV (Laban et al., 2022), a recently proposed NLIbased factual inconsistency detection model.

Num-Prec.: Accurate reporting of monetary fig-
ures is crucial in the financial domain.However, quantity hallucination is a known problem in abstractive summaries (Zhao et al., 2020).In order to evaluate the correctness of values captured in summaries, especially the abstractive ones, we define Num-Prec.as the fraction of numerals/values in the model-generated summaries that appear in the source text.Please refer A.3.
ECT-BPS-generated summaries score the highest on both content quality as well as factual consistency.

Experimental Setup
As discussed in Section 4.3, we train the two modules of ECT-BPS separately.For respectively training the extractive (and paraphrasing) modules, we initialize the FinBERT12 (and T513 ) parameters using pre-trained weights from Huggingface (Wolf et al., 2020).In the extractive module, all other parameters were set as defined in Nallapati et al. (2017).The Extractive (Paraphrasing) module is trained end-to-end with Adam Optimizer with a learning rate of 1e-5 (2e-5) and batch size 8 (16).Among the baselines, BART14 and Pegasus15 model parameters were initialized with weights pre-trained on financial data.For others, the base version of their respective models were used to initialize the parameter weights.All other model hyperparameters were initialized with default values as specified in the respective papers.
All models, including the ECT-BPS modules, were trained end-to-end with hyperparameters finetuned on the validation set (recall that we used a 70:10:20 ratio as train:validation:test split).In each case, the model with the lowest validation loss was used to evaluate the test set.All experiments were performed on a Tesla P100-PCIE (16GB) GPU.BART (1024), Pegasus (512), and T5 (512) have limitations on the length of input text that they can process.Since ECTs contain around 2.9K words on an average, for training these abstractive methods, we divided the source documents into multiple chunks, each with length less than or equal to their respective max_token_len.Corresponding target summaries were made by selecting a subset of all target summary sentences that were entailed by the sentences in the document chunk under consideration.During inference, a small summary (max 32 tokens) was generated from each document chunk.The unique sentences from all such short summaries were concatenated to produce the overall summary for the entire document.Our ECT-Sum dataset, and codes, including baselines, are publicly available on our GitHub16 repository.

Main Results
Table 3 reports the performance of all competing methods on the test set.All the unsupervised methods perform poorly, thereby highlighting the domain-specific nature of the ECT summarization task, and hence the need for supervised training.Among the supervised extractive methods, Match-Sum, a state-of-the-art extractive summarizer, has the best scores across all metrics.Here, we would like to highlight the advantage of the modifications we made to the vanilla SummaRuNNer code.Our Extractive Module, ECT-BPS w/o Paraphrasing, when compared to SummaRuNNer, achieves 18.7% improvement on average across all the ROUGE   scores, and 10.4% improvement in BERTScore.This also makes our Extractive Module the best performing extractive method across all metrics.
Please note that the Num-Prec.and SummaC CONV scores for all extractive summarizers are always 1.00 because the summary sentences are taken verbatim from the source documents.Among the abstractive methods, Pegasus and BART, despite being initialized with weights pretrained on financial data, could not match the performance of T5.Interestingly, both T5 (0.508) as well its long version, LongT5 (0.516), have very good factual consistency scores.These observations led us to select T5 as the backbone of our paraphrasing module.LED performs better on token overlap metrics (Rouge and BERTscore) but has poor factual consistency scores, highlighting the issue of hallucination in abstractive summarizers (King et al., 2022).To conclude, despite the understandably good performance of long document summarizers on the ECT summarization task, our simple extract-then-paraphrase approach, ECT-BPS, establishes the state-of-the-art performance with overall 6.8% better ROUGE scores, 3.67% better BERTScore scores, 8.5% better Num-Prec.scores, and 0.4% better factual consistency scores over the respective strongest baselines.

Evaluation by Financial Domain Experts
Given the complex nuances of the financial domain, we get the model-generated summaries evaluated by a team 10 analysts/experts working with Goldman Sachs Data Science and Machine Learning Group, India who were well-versed with the concepts of financial reporting, earnings calls, etc.For this, we create a survey with 75 randomly chosen test set ECTs and their corresponding summaries generated by ECT-BPS and LED, our strongest baseline.Each survey form (please refer to an example17 ) was divided into 5 sections.In each section, the participants were required to go through an entire ECT (Motley Fool link provided), and evaluate the two summaries (randomly placed, identity not revealed) on three quality metrics -factual correctness, relevance and coverage as defined below: • Factual Correctness: For each summary sentence, the task was to asses if it can be supported by the source ECT.• Relevance: For each summary sentence, the task was to asses if it captures pertinent information relative to the ECT.The final correctness/relevance score of the summary is then determined based on the percentage of sentences that are factually correct/relevant as follows: 5 (>80%), 4 (>60% & ≤ 80%), 3 (>40% & ≤ 60%), 2 (>20% & ≤ 40%), 1 (≤20%).It is to be noted here that factual correctness is an objective metric, whereas relevance is a subjective metric.For Coverage, the participants were instructed to assign a score to the overall summary (on a Likert scale of 1-5) based upon their impression about the amount/coverage of relevant content present in it.
Participants were adequately remunerated for their involvement in the task.The summary of  results obtained from this survey are presented in Table 4.At a summary/sample level, respectively for 60% (45/75) and 59% (44/75) of the cases, the summaries generated by ECT-BPS were found to contain more number of factually correct and relevant sentences than the corresponding LEDgenerated summaries.For 16% and 11% of the cases respectively, the scores for correctness and relevance were the same for both models.Also, 64% of the times, the participants found ECT-BPSgenerated summaries to have a broader coverage.
When we checked the results of individual experts, 70% of the participants (7 out of 10) found ECT-BPS-generated summaries to be better with respect to correctness, and relevance.On the other hand, 8 out of 10 participants found ECT-BPS-generated summaries to have a broader coverage.
The distribution of absolute scores assigned to the summaries are shown in Fig. 3 as a histogram plot.Here again we find that ECT-BPS-generated summaries are majorly scored ≥ 3 across all three metrics, whereas the majority of LED summaries are scored ≤ 3. Overall, the survey results were found to be comprehensively in favor of ECT-BPS.

Qualitative Analysis
In Table 5, we qualitatively compare the summaries generated by LED and ECT-BPS corresponding to the earnings call transcript for FleetCor Technologies Inc Q2 2021. 18The expert evaluation scores corresponding to this pair are also reported.We observe that LED wrongly produces a few monetary values which make the corresponding sentences factually incorrect.Whereas, ECT-BPS maintains the correctness of generated numbers.This may 18 https://tinyurl.com/mph93w46be attributed to our strategy of replacing numbers with placeholders while training the Paraphrasing module (please refer to Section 4.2 for details).ECT-BPS however makes a factual error in the second sentence where it misses the word adjusted.In the finance domain adjusted earnings per share is different from earnings per share.These nuances necessitates further research on the ECTSum corpus, and financial summarization in general.

Conclusion
To our knowledge, ECTSum is the first large-scale long document summarization dataset in the finance domain.Our documents consist of free-form lengthy transcripts of company earnings calls.Target summaries consist of a set of telegram-style bullet points derived from corresponding Reuters articles that cover the calls.Drawing observations from the nature of source transcripts and target summaries, we also propose a simple, yet effective extract-then-paraphrase approach, ECT-BPS, that establishes state-of-the-art performance over strong summarization baselines across several metrics.
ECTSum is an extremely challenging dataset given the high document-to-summary compression ratio.Moreover, it is highly extendable as future earnings calls are covered by media houses, such as Reuters, BusinessWire, etc. Finally, it is a very specialized one which would otherwise have costed a lot of time and resources if one had to hire experts to write the reference summaries.The mere observation that these summaries are created by (expert) analysts and can be leveraged automatically is a major milestone of the paper.We believe our contributions to the dataset and methodology will attract future research in the finance domain.
• BERTScore (Zhang et al., 2020) aligns the generated and target summaries on a token-level and uses BERT to compute their similarity scores.It correlates better with human judgements.We installed the latest version (0.3.11) of BERTScore from its official implementation20 , and calculated the scores with the recommended NLI model MICROSOFT/DEBERTA-XLARGE-MNLI.
• Num-Prec.: Accurate reporting of facts and monetary figures is crucial in the financial domain.
Extractive summaries are always expected to contain values that appear in the source text.However, quantity/numeral hallucination is a known problem in abstractive summaries, which prior works (Zhao et al., 2020) have attempted to reduce.Here, we define Num-Prec.as the fraction of numerals/values in the model-generated summaries that are consistent the source text.We use this metric to specifically evaluate the precision/correctness with which abstractive summarizers generate values.
• SummaC CONV (Laban et al., 2022) is a recently proposed NLI-based factual inconsistency detection model based on aggregation of sentencelevel entailment scores for each pair of input document and summary sentences.We used the official implementation21 of SummaC to obtain the scores for all model-generated summaries.

Figure 1 :
Figure 1: Salient unigram distribution in four equally sized segments of the source text.Higher percentages indicate higher unigram overlap.Percentages more than 25 indicate there are repetitions.

Figure 2 :
Figure 2: ECT-BPS: Our Proposed Summarization Framework.It consists of an Extractive Module that is trained to select highly salient sentences from the source document.The Paraphrasing Module is then trained to paraphrase the ECT sentences to the (Reuters) format of target summary sentences.

Figure 3 :
Figure 3: Histogram distribution for human evaluation scores assigned to model-generated summaries.

Table 1 :
ECTSum: Excerpt from the Reuters article 1 corresponding to the ECT 2 for Apple Q2 2022.

Table 2 :
Comparing the statistics of ECTSum dataset with existing long document summarization datasets.The numbers for the datasets marked with * are copied fromKryściński et al. (

Table 3 :
Comparison of representative summarizers against automatic evaluation metrics.Best scores are bold-ed.

Table 4 :
Results for the manual evaluation of modelgenerated summaries by a team of 10 financial experts.

Table 5 :
Comparing the summaries generated by LED and ECT-BPS for a given ECT (details in Section 5.6).Parts marked in red are wrongly generated.ECT-BPS better preserves the correctness of generated numbers.