Bringing Structure into Summaries: a Faceted Summarization Dataset for Long Scientific Documents

Faceted summarization provides briefings of a document from different perspectives. Readers can quickly comprehend the main points of a long document with the help of a structured outline. However, little research has been conducted on this subject, partially due to the lack of large-scale faceted summarization datasets. In this study, we present FacetSum, a faceted summarization benchmark built on Emerald journal articles, covering a diverse range of domains. Different from traditional document-summary pairs, FacetSum provides multiple summaries, each targeted at specific sections of a long document, including the purpose, method, findings, and value. Analyses and empirical results on our dataset reveal the importance of bringing structure into summaries. We believe FacetSum will spur further advances in summarization research and foster the development of NLP systems that can leverage the structured information in both long texts and summaries.


Introduction
Text summarization is the task of condensing a long piece of text into a short summary without losing salient information. Research has shown that a well-structured summary can effectively facilitate comprehension (Hartley et al., 1996;Hartley and Sydes, 1997). A case in point is the structured abstract, which consists of multiple segments, each focusing on a specific facet of a scientific publication (Hartley, 2014), such as background, method, conclusions, etc. The structure therein can provide much additional clarity for improved comprehension and has long been adopted by databases and publishers such as MEDLINE and Emerald.
Despite these evident benefits of structure, summaries are often framed as a linear, structure-less sequence of sentences in the flourishing array of summarization studies (Nallapati et al., 2017;See Title Emotion in enterprise social media systems

Purpose
The purpose of this paper is to investigate enterprise social media systems and quantified gender and status influences on emotional content presented in these systems.

Method
Internal social media messages were collected from a global software company running an enterprise social media system. An indirect observatory test using Berlo's "source-messagechannel-receiver" model served as a framework to evaluate sender, message, channel and receiver for each text. These texts were categorized by gender and status using text analytics with SAP SA to produce sentiment indications.

Findings
Results reveal women use positive language 2.1 times more than men. Senior managers express positive language 1.7 times more than non-managers, and feeling rules affect all genders and statuses, but not necessarily as predicted by theory. Other findings show that public messages contained less emotional content, and women expressed more positivity to lower status colleagues. Men expressed more positivity to those in higher positions. Many gender and status stereotypes found in face-toface studies are also present in digital enterprise social networks.

Value
This study offers a behavioral measurement approach free from validity issues found in self-reported surveys, direct observations and interviews. The collected data offered new perspectives on existing social theories within a new environment of computerized, enterprise social media.
We aim to address this issue by proposing the FacetSum dataset. It consists of 60,024 scientific articles collected from Emerald journals, each associated with a structured abstract that summarizes the article from distinct aspects including purpose, method, findings, and value. Scale-wise, we empirically show that the dataset is sufficient for training large-scale neural generation models such as BART (Lewis et al., 2020) for adequate generalization. In terms of quality, each structured abstract in FacetSum is provided by the original author(s) of the article, who are arguably in the best position to summarize their own work. We also provide  quantitative analyses and baseline performances on the dataset with mainstream models in Sections 2 and 3.

FacetSum for Faceted Summarization
The FacetSum dataset is sourced from journal articles published by Emerald Publishing 1 (Figure 1). Unlike many publishers, Emerald imposes explicit requirements that authors summarize their work from multiple aspects (Emerald, 2021): Purpose describes the motivation, objective, and relevance of the research; Method enumerates specific measures taken to reach the objective, such as experiment design, tools, methods, protocols, and datasets used in the study; Findings present major results such as answers to the research questions and confirmation of hypotheses; and Value highlights the work's value and originality 2 . Together, these facets give rise to a comprehensive and informative structure in the abstracts of the Emerald articles, and by extension, to FacetSum's unique ability to support faceted summarization.

General Statistics
We collect 60,532 publications from Emerald Publishing spanning 25 domains. From a summarization perspective, these differences imply that FacetSum may pose significantly increased modeling and computation challenges due to the increased lengths in both the source and the target. Moreover, the wide range of research domains (Figure 3, Appendix D) may also introduce much linguistic diversity w.r.t. vocabulary, style, and discourse. Therefore, compared to existing scientific publication datasets that only focus on specific academic disciplines (Cohan et al., 2018;Cachola et al., 2020), FacetSum can also be used to assess a model's robustness in domain shift and systematic generalization.
To facilitate assessment of generalization, we reserve a dev and a test set each consisting of 6,000 randomly sampled data points; the remaining data are intended as the training set. We ensure that the domain distribution is consistent across all three subsets. Besides, we intentionally leave out Open-Access papers as another test set, to facilitate researchers who do not have full Emerald access 3 .

Structural Alignment
In this section, we focus our analysis on one of the defining features of FacetSum -its potential to support faceted summarization. Specifically, we investigate how the abstract structure (i.e., facets) aligns with the article structure. Given an abstract facet A and its corresponding article S, we quantify this alignment by: Semantically, S A consists of sentence indices in S that best align with each sentence in A. Sentence-level Alignment We first plot the tuples {(s i , i/|S|) : i ∈ S A }, where s i is the i-th sentence in S, and |S| is the number of sentences in S. Intuitively, the plot density around position i/|S| entails the degree of alignment between the facet Figure 2: Oracle sentence distribution over a paper. X-axis: 10,000 papers sampled from FacetSum, sorted by full text length from long to short; y-axis: normalized position in a paper. We provide each sub-figure's density histogram on their right.  A and the article S at that position 4 . With 10,000 articles randomly sampled from FacetSum, Figure 2 exhibits distinct differences in the density distribution among the facets in FacetSum. For example, with A = Purpose, resemblance is clearly skewed towards the beginning of the articles, while Findings are mostly positioned towards the end; the Method distribution is noticeably more uniform than the others. These patterns align well with intuition, and are further exemplified by the accompanying density histograms. Section-level Alignment We now demonstrate how different abstract facets align with different sections in an article. Following conventional structure of scientific publications (Suppe, 1998;Rosenfeldt et al., 2000), we first classify sections into Introduction, Method, Result and Conclusion using keyword matching in the section titles. 5 Given a section S i ⊆ S and an abstract A j ⊆ A, we define the section-level alignment g( denotes sentences concatenation, and S i A j is defined by Equation (1). Table 2 is populated by varying A j and S i across the rows and columns, respectively. Full denotes the full paper or abstract (concatenation of all facets). We also include the concatenation of introduction and conclusion (denoted I+C) as a possible value for S i , due to its demonstrated effectiveness as summaries in prior work (Cachola et al., 2020).
The larger numbers on the diagonal (in red) empirically confirm a strong alignment between FacetSum facets and their sectional counterparts in articles. We also observe a significant performance gap between using I+C and the full paper as S i . One possible reason is that the summaries in FacetSum (particularly Method and Findings) may contain more detailed information beyond introduction and conclusion. This suggests that for some facets in FacetSum, simple tricks to condense full articles do not always work; models need to instead comprehend and retrieve relevant texts from full articles in a more sophisticated manner.

Experiments and Results
We use FacetSum to benchmark a variety of summarization models from state-of-the-art supervised models to unsupervised and heuristics-based models. We also provide the scores of a sentence-level extractive oracle system (Nallapati et al., 2017). We report Rouge-L in this section and include Rouge-1/2 results in Appendix E. Unsupervised Models vs Heuristics We report performances of unsupervised and heuristics summarization methods (see Table 3). Tailoring to the unique task of generating summaries for a specific facet, we only use the section (defined in Section 2.2) corresponding to a facet as model input. Evaluation is also performed on the concatenation  of all facets (column Full), which resembles the traditional research abstract. Lead-K/Tail-K are two heuristic-based models that extract the first/last k sentences from the source text. We observe that heuristic models do not perform well on Full, where the unsupervised models can achieve decent performance. Nevertheless, all models perform poorly on summarizing individual facets, and unsupervised models fail to perform better than simple heuristics consistently. The inductive biases of those models may not be good indicators of summary sentences on specific facets. A possible reason is that they are good at locating overall important sentences of a document, but they cannot differentiate sentences of each facet, even we try to alleviate this by using the corresponding section as input. Supervised Models As for the supervised baseline, we adopt the BART model (Lewis et al., 2020), which has recently achieved SOTA performance on abstractive summarization tasks with scientific articles (Cachola et al., 2020). We propose two training strategies for the BART model, adapting it to handle the unique challenge of faceted summarization in FacetSum. In BART, we train the model to generate the concatenation of all facets, joined by special tokens that indicate the start of a specific facet (e.g., |PURPOSE| to indicate the start of Purpose summary). During evaluation, the generated text is split into multiple facets based on the special tokens, and each facet is compared against the corresponding ground-truth summary.
In BART-Facet, we train the model to generate one specific facet given the source text and an indicator specifies which facet to generate. Inspired by CATTS (Cachola et al., 2020), we prepend section tags at the beginning of each training input to generate summaries for a particular facet (see implementation details in Appendix C).
Empirically, supervised models outperform unsupervised baselines by a large margin (Table 3). Comparing between the two training strategies, BART-Facet outperforms BART significantly. While BART performs comparably on Purpose, performance decreases drastically for subsequent facets, possibly due to current models' inadequacy with long targets. Thus it can perform decently at the beginning of generation (≈40 on Purpose), where the dependency is relatively easy-to-handle. However, the output quality degrades quickly towards the end (≈5 on Value).
With I+C as source text, both training strategies exhibit much better results than using full paper. This is opposite to the observation in Table 2, potentially due to the limitation of the current NLG systems, i.e., the length of source text has crucial impacts to the model performance. With the much extended positional embeddings in our models (10,000 tokens), we suspect some other issues such as long term dependencies may lead to this discrepancy, which warrants further investigation.
We introduce FacetSum to support the research of faceted summarization, which targets summarizing scientific documents from multiple facets. We provide extensive analyses and results to investigate the characteristics of FacetSum. Our observations call for the development of models capable of handling very long documents and outputting controlled text. Specifically, we will consider exploring the following topics in future work: (1) incorporating methods for long-document processing, such as reducing input length by extracting key sentences (Pilault et al., 2020)

C Implementation Details
To make BART take full text as input, we extend the positional embedding to 10,000 tokens. This was required to leverage long text of papers in FacetSum with average length of 6000 words. Experiments of unsupervised baselines are implemented with Sumy (Belica, 2021) and official code of HipoRank. We tune the hyperparameters of HipoRank with the validation set. The BART experiments are finetuned using Fairseq (Ott et al., 2019), with learning rate of 3e −5 , batch size of 1, max tokens per batch of 10,000 and update frequency of 4. We finetune all models for 20,000 steps with single NVIDIA Tesla V100 16GB and we report the results of the last checkpoint. The small batch size is the consequence of the large input size. For inference, we use beam size of 4 and maximum length of 500/200 tokens for BART/BART-Facet respectively.

D Domains Covered by FacetSum
In Figure 3, we show the distribution of domain categories in FacetSum.

E Full Results
In this section, we provide additional experiment results. In Table 6, we show the full results of the extractive oracle system (first row in Table 3). In Table 7, we provide full results of all other models (heuristic models, unsupervised models, and supervised models in Table 3).

F Example of Outputs by BART and BART-Facet
In  Purpose The purpose of this paper is to advance the information society research by examining and better understanding the impact of the adoption information and communication technologies (ICT) within households on improving sustainability.
The purpose of this paper is to answer the question: Does the adoption of information and communication technologies (ICT) within households contribute to sustainability? For this purpose, the study focuses on addressing the following specific questions: What criteria clearly and deeply describe the ICT adoption within households?
What are the criteria that clearly describe sustainability within households, and are there statistically significant correlations between ICT and sustainability?
The purpose of this paper is to answer the question: Does the adoption of information and communication technologies (ICT) within households contribute to sustainability?
Method The survey questionnaire was used and data collected from 679 Polish households were statistically analyzed to understand the phenomena of ICT adoption and sustainability as well as identify correlations between them.
A survey was carried out in Poland.
The data were collected through a structured questionnaire and analyzed using structural equation modeling (SEM).
The study is based on a quantitative research design. The data were collected through a questionnaire survey among the households in Poland. The survey was carried out in two phases. In the first phase, the ICT adoption within households was measured using a structured questionnaire, and in the second phase the data were analyzed using structural equation modeling (SEM) to test the hypothesized relationships.
Findings The research findings reveal that the ICT adoption is well described by the ICT outlay, information culture, ICT management and ICT quality, whereas sustainability is composed of ecological, economic, socio-cultural and political sustainability. Furthermore, the ICT quality, ICT management and information culture have a significant impact on sustainability, whereas the ICT outlay does not have such an impact.
Results show that ICT is a key enabler of sustainability in households. The results also show that there are statistically significant correlation between the IIT adoption within the households and sustainability.
The results show that the adoption of ICT within households is positively related to sustainability. The results also show that there are statistically significant correlations between the ICT adoption within households and sustainability.

Value
The paper provides and verifies a new theoretical model of sustainable information society to depict various dimensions shaping the ICT adoption and their impact on different types of sustainability in the context of households.
This study is the first to empirically investigate the impact of ICT on sustainability. The findings of this study will be complementary with findings concerning the contribution of IIT to sustainability in enterprises and allow for the advancement in the sustainable information society (SIS) research.
This study contributes to the literature by providing a deeper understanding of the ICT adoption within households and the contribution of ICT to sustainability in transition economies, i.e. the former European Eastern Bloc countries. BART-Facet provides more information of Method and less errors than BART (e.g. "IIT" is a typo of "ICT"). However both models tend to directly copy text from the source, for example both outputs of Purpose can be found in the introduction of the paper.