Making Science Simple: Corpora for the Lay Summarisation of Scientific Literature

Lay summarisation aims to jointly summarise and simplify a given text, thus making its content more comprehensible to non-experts.Automatic approaches for lay summarisation can provide significant value in broadening access to scientific literature, enabling a greater degree of both interdisciplinary knowledge sharing and public understanding when it comes to research findings. However, current corpora for this task are limited in their size and scope, hindering the development of broadly applicable data-driven approaches. Aiming to rectify these issues, we present two novel lay summarisation datasets, PLOS (large-scale) and eLife (medium-scale), each of which contains biomedical journal articles alongside expert-written lay summaries.We provide a thorough characterisation of our lay summaries, highlighting differing levels of readability and abstractivenessbetween datasets that can be leveraged to support the needs of different applications.Finally, we benchmark our datasets using mainstream summarisation approaches and perform a manual evaluation with domain experts, demonstrating their utility and casting light on the key challenges of this task.


Introduction
Scientific publications contain information that is essential for the preservation and progression of our understanding across all scientific disciplines.Typically being highly technical in nature, such articles tend to assume a degree of background knowledge and make use of domain-specific language, making them difficult to comprehend for one lacking the required expertise (i.e., a lay person).These factors often limit the impact of research to only its direct community (Albert et al., 2015(Albert et al., , 2022) ) and, more 1 * Corresponding author.

Technical Abstract
The virus SARS-CoV-2 can exploit biological vulnerabilities (e.g.host proteins) in susceptible hosts that predispose to the development of severe COVID-19.To identify host proteins that may contribute to the risk of severe COVID-19, we undertook proteome-wide genetic colocalisation tests, and polygenic (pan) and cis-Mendelian randomisation analyses leveraging publicly available protein and COVID-19 datasets...

Lay Summary
Individuals who become infected with the virus that causes COVID-19 can experience a wide variety of symptoms.These can range from no symptoms or minor symptoms to severe illness and death.Key demographic factors, such as age, gender and race, are known to affect how susceptible an individual is to infection.However, molecular factors, such as unique gene mutations and gene expression levels can also have a major impact on patient responses by affecting the levels of proteins in the body... Figure 1: The first few sentences of the abstract and lay summary of an eLife article, illustrating differences in the language and focus on background information.dangerously, can cause readers (members of the public, journalists, etc.) to misinterpret research findings (Kuehne and Olden, 2015).This latter point is especially important for biomedical research which, in addition to having particularly dynamic and confusing terminology (Smith, 2006;Peng et al., 2021), has the potential to directly impact people's decision-making regarding health-related issues, with a pertinent example of this being the widespread misinformation seen during the COVID-19 pandemic (Islam et al., 2020).Aiming to address these challenges, some academic journals choose to publish lay summaries that clearly and concisely explain the context and significance of an article using non-specialist language.Figure 1 illustrates how simplifying jar-gon (e.g., "SARS-CoV-2" → "the virus that causes COVID-19") and focusing on background information allows a reader to better understand a complex scientific topic.However, in addition to placing an extra burden on authors, lay summaries are not yet ubiquitous and focus only on newly published articles.
Automatic text summarisation can provide significant value in the generation of scientific lay summaries.Although previous use of summarisation techniques for scientific articles has largely focused on generating a technical summary (e.g., the abstract), only a few have addressed the task of lay summarisation and introduced datasets to facilitate its study (Chandrasekaran et al., 2020;Guo et al., 2021;Zaman et al., 2020).However, compared to datasets ordinarily used for training supervised summarisation models, these resources are relatively small (ranging from 572 to 6,695 articles), presenting a significant barrier to the deployment of large data-driven approaches that require training on large amounts of parallel data.Furthermore, these resources are somewhat fragmented in terms of their framing of the task, making use of article and summary formats that limit their applicability to broader biomedical literature.These factors hinder the progression of the field and the development of usable models that can be used to make scientific content accessible to a wider audience.
To help alleviate these issues, we introduce two new datasets derived from different academic journals within the biomedical domain -PLOS and eLife ( §3).Both datasets use the full journal article as the source, enabling the training of models which can be broadly applied to wider literature.PLOS is significantly larger than currently available datasets and makes use of short author-written lay summaries (150-200 words), whereas eLife's summaries are approximately twice as long and written by expert editors who are well-practiced in the simplification of scientific content.Given these differences in authorship and length, we expect the lay summaries of eLife to simplify content to a greater extent, meaning our datasets are able to cater to different audiences and applications (e.g., personalised lay summarisation).We confirm this via an in-depth characterisation of the lay summaries within each dataset, quantifying ways in which they differ from the technical abstract and from each other ( §4).Finally, we benchmark our datasets with popular summarisation approaches using automatic metrics and conduct an expertbased manual evaluation, highlighting the utility of our datasets and key challenges for the task of lay summarisation ( §5).This paper also presents a literature review ( §2), conclusions ( §6), and a discussion on its limitations ( §7).

Related Work
Past attempts to automatically summarise scientific content in layman's terms have been scarce, with the most prominent example being the LaySumm subtask of the CL-SciSumm 2020 shared task series (Chandrasekaran et al., 2020) which attracted a total of 8 submissions.Alongside the task, a training corpus of 572 articles and author-generated lay summaries from a multi-disciplinary collection of Elsevier-published scientific journals was provided, with submissions being evaluated on a blind test set of 37 articles.It was noted by the task organisers that the data provided was insufficient for training a model to produce a realistic lay summary.Guo et al. (2021) also make use of a single publication source to retrieve lay summaries: The Cochrane Database of Systematic Reviews (CDSR).Their dataset contains the abstracts of 6,695 systematic reviews paired with their respective plain-language summaries, covering various healthcare domains.Although larger than other available datasets for lay summarisation, CDSR is constrained in that it only uses the abstracts of systematic reviews as source documents, and thus models trained using CDSR will be unlikely to generalise well to inputs that are longer than an abstract or the abstracts of other types of publication.
Alternatively, Zaman et al. ( 2020) introduce a dataset derived from the 'Eureka-Alert' science news website for the combined tasks of simplification and summarisation.Summaries consist of news articles (average length > 600 words) that aim to describe the content of a scientific publication to the non-expert.However, the extensive size of reference summaries is likely to present additional challenges in model training and their news-based format limits their applicability (e.g., in automating lay summarisation for journals).
Compared to previous resources, our datasets contain articles and lay summaries of a format that we consider to be more broadly applicable to wider literature.Additionally, PLOS is significantly larger than those currently available (over 4× larger than CDSR) and eLife contains sum- maries written by expert editors.Furthermore, our work is the first to provide two datasets with different levels of readability, thus supporting the needs of different audiences and applications.Through each of these factors, we hope to enable the creation of more usable lay summarisation models.

Our Datasets
We introduce two datasets from different biomedical journals (PLOS and eLife), each containing full scientific articles paired with manually-created lay summaries.For each data source, articles were retrieved in XML format and parsed using Python to retrieve the lay summary, abstract, and article text. 1 In line with previous datasets for scientific summarisation (Cohan et al., 2018), the article text is separated into sections, and the heading of each section is also retrieved.Sentences are segmented using the PySBD rule-based parser (Sadvilkar and Neumann, 2020), which we empirically found to outperform neural alternatives.We separate our datasets into training, validation, and testing splits at a ratio of 90%/5%/5%.Statistics describing the contents of our datasets and that of past lay summarisation datasets are given in Table 1.Computational Biology, Genetics, Pathogens, and Neglected Tropical Diseases.

PLOS
eLife eLife is an open-access peer-reviewed journal with a specific focus on biomedical and life sciences.Of the articles published in eLife, some are selected to be the subject of a digest, a simplified summary of the work written by expert editors based on both the article itself and questions answered by its author.Similarly to PLOS, these digests aim to explain the background and significance of a scientific article in language that is accessible to non-experts (King et al., 2017).

Dataset Analysis
We carry out several analyses comparing the lay summaries of our datasets to the respective technical abstracts.Through these analyses, we seek to highlight and quantify the key differences between these two different types of summary, as well as those present between the lay summaries of our two datasets.Specifically, we focus on readability ( §4.1), rhetorical structure ( §4.2), vocabulary sharing ( §4.3), and abstractiveness ( §4.4).

Readability
We assess the readability of our lay summaries and abstracts using several established metrics.Specifically, we employ Flesch-Kincaid Grade Level (FKGL), Coleman-Liau Index (CLI), Dale-Chall Readability Score (DCRS), and WordRank score.3FKGL, CLI, and DCRS provide an approximation of the (US) grade level of education required to read a given text.The formula for FKGL surrounds the total number of sentences, words, and syllables present within the text, whereas CLI is based on the number of sentences, words, and characters.Alternatively, DCRS measures readability using the average sentence length and the number of familiar words present, using a lookup table of the 3,000 most commonly used English words.Similarly, WordRank estimates the lexical complexity of a text based on how common the language is, using a frequency table derived from English Wikipedia.
The scores given in Table 2 show that the lay summaries of both datasets are consistently more readable than their respective abstracts across all metrics.Although these differences are small in some cases, in line with the findings of previous works (Devaraj et al., 2021), we find them all to be statistically significant by way of Mann-Whitney U tests (p < 0.05).These results indicate that lay summaries are more readable than technical abstracts in terms of both syntactic structure and lexical intelligibility.Additionally, the lay summaries from eLife obtain lower readability scores than those of PLOS across all metrics, confirming our expectation that they are suitable for less technical audiences.4

Rhetorical Structure
Rhetoric is another important factor when assessing the comprehensibility of a text.Specifically, a lay person will require a much larger focus on the background of a scientific article than an expert in order to understand the significance of its findings (King et al., 2017), thus we would expect lay summaries to focus more on such aspects.
To provide further insight into the structural differences between abstracts and lay summaries, we classify all sentences within each based on their rhetorical status.To do this, we make use of PubMed RTC (Dernoncourt and Lee, 2017), a dataset containing the 20,000 biomedical abstracts retrieved from PubMed, with each sentence labelled according to its rhetorical role (roles: Background, Objective, Methods, Results, Conclusions).We use PubMed RTC to train the BERT-based sequential classifier introduced by Cohan et al. (2019) due to its strong reported performance (92.9 micro F1-score), before applying this model to lay summary and abstract sentences from our datasets.
Figure 2 provides a visualisation of how the fre- quency of each rhetorical class changes according to the sentence position within our summaries.For each sub-graph, observing the pattern of most frequent labels (tallest bars) across all positions allows us to get an idea of the dominant rhetorical structure.In Table 3, we further quantify the difference in structure by giving the average percentage of each label present in the different summaries.
For both datasets, we see a similar pattern when comparing abstract and lay summary distributions.Specifically, a much greater portion of lay summary sentences is dedicated to explaining the relevant background information ("Background").This is unsurprising, as such information is essential to understanding the motivation and significance of any work and, thus, would be of great value to a non-expert.This additional focus on "Background" comes at the expense of sentences focusing on "Results" and (to a lesser extent) "Methods", which are less frequent within lay summaries.Again, this is to be expected, as these details are less meaningful to an audience without domain expertise.

Content Words
Aiming to determine what terminology is shared between summary types, we analyse the frequency at which content words occur simultaneously within abstracts and lay summaries.We treat nouns, proper nouns, verbs, and numbers as content words, and we extract these from the summaries using ScispaCy (Neumann et al., 2019), a library that specialises in the processing of biomedical texts. 5igure 3 shows the results of our analysis, visual- typically occur in a greater number of abstracts (most commonly being present within 2-10).For content words of all types, we observe that the ratio of 'shared' to 'not shared' generally increases in line with the number of abstract occurrences. 7

Abstractiveness
We follow the example of prior works (Sharma et al., 2019;See et al., 2017) by calculating the abstractiveness of our summaries using n-gram novelty, thus providing a measurement of the degree to which the summary uses different language to describe the content of the article.Specifically, for both abstracts and lay summaries, we compute the percentage of summary n-grams which are absent from their respective article.The results of this analysis are presented in Figure 4, where we can observe that lay summaries consistently contain more novel n-grams than abstracts across both datasets.However, the lay summaries of eLife, in addition to being approximately twice as long as those of PLOS (Table 1), appear to be significantly more abstractive.Alongside differences in readability (highlighted in §4.1), we believe these to be important distinctions that should be considered in determining a suitable dataset for a particular use case or application.

Experiments and Results
To help facilitate future work, we benchmark our datasets using popular heuristics-based, unsupervised, and supervised summarisation approaches ( §5.1).Additionally, we provide further insight into these results via a detailed discussion ( §5.2) and an expert-based manual evaluation ( §5.3). 7Better illustrated by Table 7 in Appendix A, which gives

Baseline Approaches
For our heuristics-based approaches, we include the widely-used LEAD-3 baseline which simply uses the first three sentences of the main body of the text.As our lay summaries typically consist of more than three sentences, we also include LEAD-K, with k being equal to the average lay summary length for each dataset (Table 1).Additionally, we include the scores obtained by the technical abstracts (ABSTRACT) and ORACLE_EXT, a greedy extractive oracle (Nallapati et al., 2017) that provides an upper bound for the expected performance of extractive models. 8e benchmark four unsupervised extractive approaches: LSA (Steinberger and Jezek, 2004), LEXRANK (Erkan and Radev, 2004), TEXTRANK (Mihalcea and Tarau, 2004), and HIPORANK (Dong et al., 2021).For supervised models, we use the transformer-based BART base model (Lewis et al., 2020), which we fine-tune on our own datasets.To assess how the use of additional data in various forms can benefit performance, we include several other variants of this model which are described in the remainder of this subsection.9 Additional training As our datasets remain smaller than those used in other forms of summarisation, we experiment with BART P ubM ed , which is previously trained on the PubMed abstract generation dataset (Cohan et al., 2018) and fine-tuned on our own datasets.Aiming to assess how well models trained on PLOS can generalise to eLife and vice versa, we also include BART Cross , which is initially trained on the opposite dataset to that which it is eventually fine-tuned and evaluated on.
Scaffolding We also experiment with artificially enlarging our training data by way of a scaffold task.Inspired by CATTS (Cachola et al., 2020), we remove the article's abstract from the input text and train the model to generate the abstract as a scaffold task to lay summarisation.In addition to showing whether training for abstract generation can benefit lay summarisation, results for this model will provide an indication of the baseline BART model's reliance on the abstract content.Specifically, we include two copies of every article within our training data -one using the abstract as the reference summary and the other using the lay summary.We the exact percentages shown in Figure 3. distinguish between the two by prepending the input document with the control tokens ⟨|ABSTRACT|⟩ or ⟨|SUMMARY|⟩.Documents within the validation and test splits are prepended the ⟨|SUMMARY|⟩ code to induce lay summary generation.This model is denoted by BART Scaf f old .

Discussion
Table 4 presents the performance of the aforementioned approaches on the PLOS and eLife test splits using automatic metrics.In line with common practice for summarisation, we report the F1-scores of ROUGE-1, 2, and L (Lin, 2004).Additionally, we include FKGL and DCRS scores of the generated output (see §4.1), providing an assessment of the syntactic and lexical complexity, respectively.
The importance of the abstract Based on the ROUGE scores obtained by the ABSTRACT baseline, we can safely assume that the lay summaries of PLOS are much closer in resemblance to their respective abstracts than those of eLife.The importance of the abstract for lay summary generation is further highlighted by the ROUGE scores of the BART Scaf f old model, which performs notably worse than the standard BART model on PLOS and slightly worse on eLife.These results suggest that having the abstract included within the model input provides significantly more benefit than using abstract generation as an auxiliary training signal.
Extractive vs abstractive In general, we would expect abstractive methods to have greater application for the task of lay summarisation due to their ability to transform (and thus, simplify) an input text.However, abstractive approaches have a ten-dency to generate hallucinations, resulting in factual inconsistencies between the source and output that damages their usability (Maynez et al., 2020).Therefore, extractive approaches may still have utility for the task, especially if the comprehensibility of selected sentences is directly considered.
For ROUGE scores, we find that extractive baselines (i.e., all unsupervised and heuristic approaches) perform significantly better on PLOS than on eLife, aligning with our previous analysis ( §4.4) which identified PLOS as the less abstractive dataset.Interestingly, readability scores achieved by extractive models on PLOS match and sometimes exceed those of abstractive BART models, although they are inconsistent.For eLife, abstractive methods (i.e., BART models) generally obtain superior scores for both ROUGE and readability metrics.In fact, the ROUGE scores achieved by BART exceed those obtained by ORACLE_EXT, further indicating that abstractive methods have greater potential for this dataset.
Use of additional data As previously mentioned, artificially creating more data via an abstractgeneration scaffold task results in a decrease in ROUGE scores for both datasets, indicating a reliance on the abstract content for lay summarisation.We also find that pretraining BART on Pubmed (BART P ubM ed ) does very little to affect the performance, suggesting that habits learned for abstract generation do not transfer well to lay summarisation.Similarly, BART Cross achieves a performance close to that of the standard BART model.Overall, these results indicate that additional outof-domain training does provide much benefit for Dataset Comp.Layness Factuality PLOS 3.7 3.0 3.0 eLife 3.1 3.0 3.0 lay summarisation, and alternative modelling approaches that make better use of available data may be a more promising route for future work.

Human evaluation
To further assess the usability of the generated abstractive summaries, we perform an additional human evaluation of our standard BART baseline model using two domain experts. 10 Our evaluation uses a random sample of 10 articles from the test split of each dataset.Alongside each model-generated summary, judges are presented with both the abstract and reference lay summary of the given article.Using a 1-5 Likert scale, the annotators are asked to rate the model output based on three criteria: (1) Comprehensiveness -to what extent does the model output contain all information that might be necessary for a non-expert to understand the high-level topic of the article and the significance of the research; (2) Layness -to what extent is the content of the model output comprehensible (or readable) to a non-expert, in terms of both structure and language; (3) Factuality -to what extent is the model output factually consistent with the two other provided summaries.We choose not to provide judges with the full article text in an effort to minimise the complexity of the evaluation and the cognitive burden placed upon them.
Table 5 presents the average ratings from our manual evaluation.We calculate Krippendorff's α to measure inter-rater reliability, where we obtain values of 0.78 and 0.54 for PLOS and eLife, respectively.In addition to providing ratings, evaluators also provided comments on the general performance on each criterion for both datasets, providing further insights into model performance.
Comprehensiveness We can see from Table 5 that model outputs on PLOS are judged to be more comprehensive than on eLife.From evaluators' comments, we understand that this largely results from extensive use of abstract content for PLOS, which is sometimes copied directly (or with minor 10 Both judges have experience in scientific research and hold at least a bachelor's degree in Biomedical Science.edits) to the lay summary.For eLife, it was observed that new information (i.e., not contained in the reference abstract or lay summary) was often introduced which was irrelevant or confusing, potentially affecting the understanding of a lay reader.
Layness Interestingly, given the previously highlighted differences in readability, the average layness of the model output is judged to be equal for both datasets (3.0), suggesting a reasonable degree of content simplification.However, evaluators' comments indicate that model outputs for each dataset were penalised for different reasons.For PLOS, the aforementioned use of abstract content often resulted in the inclusion of jargon terms that a lay reader would struggle to interpret.Alternatively, the language of eLife outputs was observed to be better suited to a lay audience but was sometimes simplified to a point that it could be misconstrued and mislead a reader, occasionally containing grammatical errors, typos, and repeated content.
Factuality Again, we find an equal average score of 3.0 given for factuality, suggesting the model struggles to produce factually correct outputs for both datasets.In fact, we found no output from either dataset was given a perfect score by both annotators, indicating that simplifying technical content accurately is a consistent problem.Evaluators' comments highlight contradictions, unclear phrasing, and misrepresentation of entities as key contributing factors to factual inconsistencies.We regard this as an integral obstacle to overcome in the development of usable lay summarisation models and an essential focus for future research.

Conclusion
In this work, we have introduced PLOS and eLife, two new datasets for the lay summarisation of biomedical research articles.Compared to currently available resources, these datasets possess source article and summary formats that are more broadly applicable to wider literature, with PLOS also being larger by a significant margin.A thorough analysis of our lay summaries highlights key differences between datasets, enabling them to cater to the needs of different audiences and applications.Specifically, in addition to being approximately twice as long as those of PLOS, we find eLife summaries to be both more readable and abstractive, thus better suited to a less technical au-dience.To facilitate future research, we benchmark our datasets with popular summarisation models using automatic metrics and conduct an expert-based human evaluation, providing further insight into the intricacies of model performance on our datasets and highlighting key challenges for the task of lay summarisation.

Limitations
Although we introduce the largest dataset available to the task of lay summarisation, our datasets remain smaller than those available for other forms of summarisation (e.g., abstract generation), where there exists datasets containing 100,000+ articles.This is largely due to the fact that lay summaries are less ubiquitous that other forms of summary (e.g., the abstract), only being used in a relatively small number of journals, of which only some are open-access and available to be utilised for such purposes as dataset creation.
On a related note, another potential limitation of our datasets is the fact they only cover a single broad domain -biomedicine.Again, this comes down to the availability of data, and the fact that the use of lay summaries is much less common in other scientific domains (e.g.Computer Science).There is, however, a reason for this disparity in the adoption of lay summaries between domains, as it is generally considered more important that the public have an awareness and understanding of research breakthroughs in health-related areas such as biomedicine.Therefore, we believe it is in these domains that automatic lay summarisation can provide the greatest benefit, although we also hope to address the lay summarisation of other domains in future work.

A Appendix
Availability of data sources Both PLOS and eLife are open-access journals, with an emphasis on making scientific research accessible to a wide audience.PLOS articles are available to be mined, reused, and shared by anyone, as per their data mining policy.11eLife articles are available under the permissive CC-BY 4.0 license, and thus also available to be retrieved and shared for these purposes. 12The data for PLOS and eLife was retrieved on 7/03/22 and 11/03/22, respectively.All articles and summaries are in English only.Our datasets are made available to the community to facilitate future research.
Additional data processing details Here we provide some additional details regarding the dataset creation process, building on the description given in §3.For eLife, the XML files retrieved were found to include multiple versions of the same article, identifiable by the article id which includes a version number.For these, we removed duplicates and kept only the most recent versions.
For both datasets, prior to extraction, we remove all Tables, Figures, and sections marked with the tag "supplementary-material".We also do not extract sections with the heading "acknowledgments".During sentence segmentation, we use a regular expression to identify and temporarily replace all "et al." occurrences with unique placeholder tokens, which are then replaced following segmentation.Following segmentation, we again use a regular expression to identify and remove all sentences which began with "DOI:" followed by a URL from both abstracts and lay summaries, as these were found to commonly occur at the end of both.
Comparison to previous datasets We provide a comparison of the readability and abstractiveness of lay summaries for all lay summarisation datasets in Table 6 and Figure 5, respectively.7 gives the exact percentages that are visualised in Figure 3, allowing for a more detailed analysis of content word sharing (e.g., calculating the ratio 'shared' to 'not shared' for different content word types).

Supplementary figure details Table
Baseline model details Here we provide additional experimental details for our baselines approaches (Table 4).ORACLE_EXT is a greedy oracle, which means it repeatedly extracts the next article sentence that will maximise the mean ROUGE scores (1, 2, and L) of the extracted summary, up to the maximum length (equal to the average lay summary length for a given dataset -Table 1).
For all BART models, we make use of the huggingface library (Wolf et al., 2019).Specifically, we use the "facebook/bart-base" model for baselines BART, BART Cross , and BART Scaf f old , and we use the "mse30/bartbase-finetuned-pubmed" model for BART P ubM ed .Training was run (using 4x NVIDIA Tesla V100 SXM2 GPUs) for all models with AdamW optimisation (Loshchilov and Hutter, 2019) and an early stopping patience of 25 epochs, with the best model being selected by performance on the validation set (ROUGE-2).
All unsupervised baselines were run with default configurations.
Automatic evaluation For the calculation of ROUGE scores, we use the rouge-score Python package. 13For FKGL and DCSR, we use the textstat Python package. 14uman evaluation comments Comments on the general model performance for each criterion provided by each annotator for our human evaluation are given in Figures 6 and 7 for PLOS and eLife, respectively.

Lay summary examples
Full examples of lay summaries and their respective technical abstracts are given in Figures 8 and 9 for PLOS and eLife, respectively.

Comprehensiveness
Annotator 1: The model outputs summarised the important information, however it also cut out quite a lot of background info which is key for understanding the science.Overall, the comprehensiveness was enough for a non-expert to grasp the overall gist of the studies.
Annotator 2: The model seems to use parts of the abstract, and therefore seems quite comprehensive.It also does a decent job of a final "summary" sentence to the lay summary to summarize/put into context.

Layness
Annotator 1: Some abstracts contained a lot of jargon which would be confusing/off putting to a non-expert.Although I know some scientific words cannot be substituted, it would be good to have an explanation of the more complex words in brackets, for example.Annotator 2: Due to the overlap with the reference abstract, the output is comprehensive but probably confusing to a lay audience, in some cases there is no introduction/background on scientific jargon (e.g.we cannot expect a lay audience to understand complex scientific techniques, cellular or molecular machinery).Use of Genus species nomenclature is also likely to confuse lay audiences, where a common name could be used instead, not as well as (e.g.'Egyptian mosquito' instead of 'Aedes aegypti', also known as the Egyptian mosquito').

Factuality
Annotator 1: The majority of the statements were factually correct, although sometimes the meaning of the simplified language could be misinterpreted, which would result in a similar outcome to factually wrong statements.Annotator 2: Some minor factual errors throughout, and mixups between gene symbols (e.g.where one letter will be changed, PMN to PMA), there are also come cases where it will pull out a % but mix up what gene / condition it is related to, ultimately leading to the formation of a sentence which is factually incorrect.

Comprehensiveness
Annotator 1: Overall the information contained within the model-generated summaries effectively conveyed the information in the references.However, there was a few occasions where new elements/concepts were introduced that could confuse the reader and affect their understanding (these were sometimems factual statements, sometimes seemingly made up).The model abstract did provide enough information for a general understanding of the topic and would be sufficient as a brief overview.
Annotator 2: The model usually picks up on the core points of the abstract but can often introduce extra information which is either off-topic or factually incorrect.The model seems to start off by introducing the topic well but struggles to hit the "what is the significance of this research?"question.

Layness
Annotator 1: The language was, in the majority, well suited to a lay person and terminology was adapted accordingly.However, at times the information was simplified to a point where it could be misconstrued which, with scientific information, is a potential risk.At times, jargon still remained and I could imagine some people being confused by this.There were a few grammatical errors and poor sentence structure, typos, repetition etc. Annotator 2: Sometimes the model introduces extra information which is not suitable for a lay audience, for example: references to genome sequencing, progenitor cells, endoplasmid reticulum.There are also instances where misinterpretation by the model may mislead the lay audience, for example there was an output where Norepinephrine was said to be "a.k.a.dopamine", which is not factually correct.

Factuality
Annotator 1: There were a few summaries which contained incorrect information, things that are wellknown in the scientific community were poorly conveyed.At times, new information was introduced which contradicted earlier statements, and those of the reference abstracts/lay summaries.Of course, some information was correct.I would be concerned about the level of misinformation which could arise from these summaries, if used to educate a lay audience.
Annotator 2: This seemed to vary based on the abstract and how well the output started, for example if the model introduced the topic well, it would lead to more factual points.However, there were some generated summaries which were factually incorrect from the start and this lead to more errors.

Technical Abstract
Adult stem cells are responsible for life-long tissue maintenance.They reside in and interact with specialized tissue microenvironments (niches).Using murine hair follicle as a model, we show that when junctional perturbations in the niche disrupt barrier function, adjacent stem cells dramatically change their transcriptome independent of bacterial invasion and become capable of directly signaling to and recruiting immune cells.Additionally, these stem cells elevate cell cycle transcripts which reduce their quiescence threshold, enabling them to selectively proliferate within this microenvironment of immune distress cues.However, rather than mobilizing to fuel new tissue regeneration, these ectopically proliferative stem cells remain within their niche to contain the breach.Together, our findings expose a potential communication relay system that operates from the niche to the stem cells to the immune system and back.The repurposing of proliferation by these stem cells patch the breached barrier, stoke the immune response and restore niche integrity.

Lay Summary
Most, if not all, tissues of an adult animal contain stem cells.These stem cells regenerate and repair damaged tissues and organs for the entire lifetime of an animal, contributing to a healthy life.They divide to make daughter cells that become either new stem cells or specialized cells of that organ.Adult stem cells exist in specific areas within tissues known as niches, where they interact with surrounding cells and molecules that inform their behavior.For example, cells and molecules within these niches can signal stem cells to remain in a 'dormant' state, but upon injury, they can mobilize stem cells to form new tissue and repair the wound.So far, it has been unclear how stem cells sense damage and stress and direct their efforts away from their normal duties towards repair.Here, Lay et al. studied the stem cells in the mouse skin that are responsible to regenerate hair.Every hair follicle contains a niche (the 'bulge'), where these stem cells live and share their environment with cells that anchor the hair.The niche tethers to the stem cells through specific adhesion molecules that also help the niche to form a tight seal to prevent bacteria from entering.Lay et al. removed one of the adhesion molecules called E-cadherin, which caused a breach in the niche's barrier.The stem cells sensed their damaged niche, prepared to multiply, and sent out stress signals to the immune system.The immune cells then arrived at the niche and sent signals back to the stem cells, prodding them to multiply and patch the barrier, while at the same time, keeping the inflammation in check.This remarkable ability of the stem cells to recruit immune cells and initiate a dialogue with them enabled the stem cells to divert their attention from regenerating hair and instead directing it towards the site of the tissue damage.Other stem cells, such as those in the lung or gut, may have similar mechanisms to detect and respond to physical damage.It will be interesting to investigate the underlying mechanism of how immune cells are involved in balancing stem cell regenerative capacity and response to physical damage.A better knowledge of these processes could help to regenerate tissues or even entire organs.

Figure 2 :
Figure 2: Barplot visualising the rhetorical class distributions in our abstracts and lay summaries.

Figure 3 :
Figure 3: Stacked barplot showing how regularly (on average) abstract content words are shared with the respective lay summaries (as a % of all words of that type), separated by number of abstract occurrences.

Figure 4 :
Figure 4: Barplot showing the percentage of novel ngrams for each summary type.

Figure 6 :
Figure 6: Human evaluation comments for PLOS.

Table 3 :
Mean percentage of each rhetorical label within our abstracts and lay summaries.

Table 4 :
FKGL DCSR Performance of summarisation models on the test splits of each dataset (R = average ROUGE F1-score).The best non-heuristic scores for each metric are given in bold.