BioLaySumm 2023 Shared Task: Lay Summarisation of Biomedical Research Articles

This paper presents the results of the shared task on Lay Summarisation of Biomedical Research Articles (BioLaySumm), hosted at the BioNLP Workshop at ACL 2023. The goal of this shared task is to develop abstractive summarisation models capable of generating “lay summaries” (i.e., summaries that are comprehensible to non-technical audiences) in both a controllable and non-controllable setting.There are two subtasks: 1) Lay Summarisation, where the goal is for participants to build models for lay summary generation only, given the full article text and the corresponding abstract as input; and2) Readability-controlled Summarisation, where the goal is for participants to train models to generate both the technical abstract and the lay summary, given an article’s main text as input.In addition to overall results, we report on the setup and insights from the BioLaySumm shared task, which attracted a total of 20 participating teams across both subtasks.


Introduction
Biomedical publications report upon the latest research concerning prominent health-related topics, ranging from common illnesses to global pandemics (Wang et al., 2020).Accordingly, the content of these publications is of interest to a wide variety of audiences, including researchers, medical professionals, journalists, and even members of the public.However, the highly technical and specialist language used within such articles typically makes it difficult for non-expert audiences to understand their contents.This results in useful knowledge and findings having limited accessibility to the general public (Guo et al., 2021;Goldsack et al., 2022;Luo et al., 2022b).
Abstractive summarisation models can be used to generate a concise summary of an article, capturing its most salient points using words and sentences that do not necessarily appear in the original text of the article.As such, these models have the potential to make highly technical documents accessible to a much wider audience through the generation of "lay summaries" -more readable summaries consisting largely of background information and containing minimal technical terminology (Guo et al., 2021;Goldsack et al., 2022;Luo et al., 2022b).
The BioLaySumm shared task1 focuses on the abstractive summarisation of biomedical articles whilst placing an emphasis on controllability and ensuring comprehensibility for non-expert audiences.Through this shared task, we aim to foster increased research interest in Lay Summarisation (in both controllable and non-controllable settings), enabling further progression for novel model development and high-quality dataset construction.In turn, we hope this will help to broaden the accessibility of technical texts to non-specialist audiences and to drive progress towards more usable and effective abstractive summarisation models for the biomedical domain with the ability to cater to audiences possessing different levels of expertise.
In this paper, we present the results of the first BioLaySumm shared task, hosted by the BioNLP Workshop at ACL 2023.We cover the task formulation ( §2), datasets ( §3), and evaluation procedure ( §4), before providing a description of the participating systems, overall results, and notable insights ( §5).

Task Description
The shared task is composed of two separate subtasks, focusing on 1) the generation of summaries more suitable for a lay audience (Lay Summarisation), and 2) the development of controllable summarisation models capable of catering to audiences with different levels of expertise (Readabilitycontrolled Summarisation).

Subtask 1: Lay Summarisation
Given an article's abstract and main text as input, the goal is for participants to train a model (or models) to generate the lay summary.Two separate datasets, PLOS and eLife (derived from the eponymous biomedical journals), were provided for model training and will be used for evaluation (more details on datasets are given in §3).For the evaluation, we average submission performance across both datasets.
For this task, we allowed submissions to be generated from either two separate summarisation models (i.e., one trained on each dataset) or a single unified model (i.e., trained on both datasets).Participants were required to indicate which approach was taken for each submission, in addition to whether or not they made use of additional training data (i.e., data not provided specifically for the task).

Subtask 2: Readability-controlled Summarisation
Given the main text of an article as input, the goal is for participants to train a model (or models) to generate both the technical abstract and the lay summary.A single dataset, PLOS, is provided for training and evaluation.We allowed submissions to use multiple ensemble models but still generate technical summary and the lay summary from the same model, and also one single main model with different output layers to generate two different summary types.As with subtask 1, participants are required to indicate whether or not they made use of additional training data for each submission.For the evaluation, we average submission performance across both summary types.

Datasets
The datasets used within each subtask are based on the previous works of Goldsack et al. (2022) and Luo et al. (2022b), and are derived from two different biomedical publications: Public Library of Science (PLOS) and eLife.Each dataset consists of research articles, their technical abstracts, and their expert-written lay summaries.As detailed in §2, each form of summary within these datasets (i.e., abstract and lay summary) has a different utility in each subtask.The lay summaries of each dataset also exhibit numerous notable differences in their characteristics, with eLife's lay summaries being longer, more abstractive, and more readable than those of PLOS.Furthermore, for PLOS, lay summaries are author-written, and articles are derived from 5 peer-reviewed journals covering Biology, Computational Biology, Genetics, Pathogens, and Neglected Tropical Diseases.For eLife, lay summaries are written by expert editors (in correspondence with authors), and articles are derived from the peer-reviewed eLife journal that covers all areas of the life sciences and medicine.For more detailed analysis of dataset content, please refer to Goldsack et al. (2022).
Table 1 summarises the data split information for both datasets.Note that the training and validation sets used for both datasets are equal to those published in Goldsack et al. (2022).Furthermore, that the training and validation splits of PLOS are the same for both subtasks.
Alternatively, we collect new test splits for both PLOS and eLife data using more recently published articles from each respective journal.The test data for Subtask 1 is composed of 142 PLOS articles and 142 eLife articles.The test data for Subtask 2 is composed of 142 PLOS articles (however, these are different from those used in Subtask 1).
In utilising these datasets for our task, we hope to enable the training of abstractive summarisation models that are capable of generating lay summaries for unseen articles covering a wide range of biomedical topics, enabling the significance of new, important publications to be effectively communicated to non-expert audiences.

Evaluation
For both subtasks, we evaluate summary quality according to three criteria -Relevance, Readability, and Factuality -where each criterion is composed of one or more automatic metrics: • Relevance: ROUGE-1, 2, and L (Lin, 2004) and BERTScore (Zhang et al., 2020b).
• Factuality: BARTScore (Yuan et al., 2021), fine-tuned on our respective datasets (as has proven effective in recent work (Koh et al., 2022)).2 For Subtask 1, the scores calculated for each metric are the average of those calculated independently for the generated lay summaries of PLOS and eLife.The aim is to maximise the scores for Relevance and Factuality metrics and minimise scores for Readability metrics.
For Subtask 2, the scores presented for each metric are the average of those calculated independently for the generated abstracts and lay summaries.Notably, for Readability metrics in this subtask, we calculate the absolute difference between the scores of generated summary and target summary pairs (rather than simply using the scores obtained for generated summaries, as in subtask 1).The aim is to maximise the scores for Relevance and Factuality metrics and minimise the absolute difference scores calculated for Readability metrics.
Following the submission deadline for each subtask, an overall ranking is calculated based on the cumulative rank of the evaluation criteria, where a lower overall ranking equates to better overall performance).To produce a criterion ranking, we apply min-max normalisation to the scores of each metric, before averaging across metrics within each evaluation criterion.

Shared Task Submissions
For both subtasks, we include a baseline system based on BART-base (Lewis et al., 2020) in order to provide a simple, widely-used benchmark with which submission performance can be compared.For subtask 1, this baseline system is composed of two separate BART models, trained independently on the PLOS and eLife datasets.For subtask 2, the baseline system is a controllable BART model, trained to generate either the abstract or lay summary of an article based on the inclusion of control tokens prepended to the input document ([ABSTRACT] and [SUMMARY], respectively).Participating teams for each subtasks were allowed to make a maximum of 3 submissions in total.

Systems Overview
Subtask 1 attracted a total of 20 participating teams, between them making a total of 49 submissions.A brief explanation of the modelling approach taken by each team is given below:3 LHS712EE (Liu et al., 2023) The team employed a BART model for eLife and a Longformer Encoder-Decoder (LED) model (Beltagy et al., 2020) for PLOS, whilst also experimenting with optimising memory usage.
GRASUM (Rosati, 2023) Standing for Grounded, Referenced, and Annotated SUMmaries, this team's method combines approaches from retrieval augmentation, offline RL, and controlled generation, using a LED model with 16k input limit as the base model.A "grounding" step enhances each document with content retrieved from scientific abstracts, Wikipedia, simple Wikipedia, and UMLS that is appended to input, in addition the bibliographic reference string of the source document (obtained from CrossRef).An additional "annotation" step annotates each source document with control tokens that indicate whether the corresponding summary achieves higher or lower than the median score for each of the task evaluation aspects (given in §4).
BITSpi This team's method involves fine-tuning two separate BART models on pre-processed versions of each dataset.Specifically, stopwords are removed from the input data, and abbreviations are substituted for their full forms using a medical dictionary.
APTSum (Poornash et al., 2023) A three-step approach is adopted by this team, leveraging the SimCLS contrasting learning framework (Liu and Liu, 2021).Specifically, they first perform content selection, identifying the Abstract and Introduction as best model input, before generating candidate summaries using BART, followed section-wise reranking using a RoBERTa-base model to capture section-based salience information.
LaSTUS-FBK This team used a multi-stage unified approach, first cleaning the data via reference  VBD-NLP (Phan et al., 2023) This team's method is based on the combined use of sequenceto-sequence model BioBART (Yuan et al., 2022) and FACTORSUM (Fonseca et al., 2022), a factorized energy-based model that aims to identify the most important input content, enabling more effective processing of long documents.Additional experimentation with handling length as well as utilising other Pretrained Language Models (PLMs) was also carried out.

Pathology Dynamics (Al-Hussaini et al., 2023)
The team experimented with multiple different approaches based on BART and T5 models including methods of content selection, the use of efficient attention mechanisms (to better process long documents), and the zero-shot simplification of model outputs.Of those tested, the approach that achieved the best overall performance was BART-large, pretrained on CNN-DM dataset, with inputs truncated to 1024 tokens.

Pos. Team
Relevance  Arizona Sky This team first truncate input documents, before using them to train two separate BART base models.
IKM_Lab (Wu et al., 2023) This team experimented with the use of a LED model trained on both datasets, as well as the adoption of different formats for including additional article information, such as keywords and section headings, in the input.
NCUEE-NLP (Chen et al., 2023) This team also made use of different models for each submission, including Primera (Xiao et al., 2022), a PEGASUS model (Zhang et al., 2020a) pretrained on PubMed, and a BART-large Longformer model.
himil The team experimented with both BERT (Devlin et al., 2019) and Longformer-based models, trained individually on each dataset.

Results
Table 2 presents the performance of the submission selected to appear on the leaderboard by each team according to the defined task metrics and Table 3 presents the rankings of these submissions (both overall and according to each individual criteria) following the application of the evaluation process described in §4.
In general, we find that more teams opted for the use of two models (10 out of 20), one for each of the two provided datasets, rather than a single unified model trained on both datasets (6 out of 20).Furthermore, the use of additional training data (i.e., data not provided as part of this task) to directly fine-tune models was relatively rare, with only 3 confirmed instances.However, all participants decided to make use of pre-trained language models (PLMs) in their submissions.In terms of the specific models used, we find BART-based models (e.g., BART, Longformer Encoder-Decoder, etc.) to be a particularly popular choice amongst teams, being utilised by 11 out of 13 teams who provided detailed descriptions of their method.Finally, we observe that several teams also chose to experiment with data preprocessing, implementing methods such as data cleaning, data annotation, and data augmentation with varying degrees of success.
We find that the best overall system (i.e., that which achieved the lowest summed ranking across the three evaluation criteria) is that of team MDC, whose best submission utilises a single ChatGPTbased model (text-davinci-003) coupled with fewshot prompting to generate the lay summaries of both datasets, based on only the abstracts.Although this system does not achieve the best performance in any individual criteria, it achieves a strong performance for both Relevance and Factuality (ranking 3rd for both) whilst maintaining an above-average Readability ranking (10th).The fact that the cumulative rank of this system is equal to 16 is evidence that no model is able to achieve universally strong performance across all criteria (relative to other submissions).However, the fact that the top-ranking submission is based on only few-shot in-context learning (i.e., without any finetuning on the provided training data) suggests that Large Language Models have the potential to offer significant benefits for Lay Summarisation.
Interestingly, this is the only submission to achieve a better cumulative rank than that of the BART baseline system (18th), which is shown to rank first for Factuality, and above average for the other two criteria (9th and 8th for Relevance and Readability, respectively).We originally suspected that a possible explanation for the baseline system's strong performance in terms of Factuality is a potential bias of BARTScore towards BARTbased models.However, the leaderboard results do not seem to support this, with BART-based models being widely used and achieving a wide range of scores.
Two teams tied for third in terms of overall ranking, with both Marsfield_SDS and VBD-NLP achieving a cumulative rank of 19.Each of these teams adopted innovative and diverse strategies with their submissions.Marsfield_SDS focused largely on data augmentation including the use of ChatGPT for generating lay summary paraphrases, resulting in particularly strong performance in terms of Relevance (2nd).Alternatively, VBD-NLP experimented with the use of the factorised energy-based model FACTORSUM, achieving a good all-rough performance across all criteria.Finally, the 5th placed submission of team LHS712EE is also worthy of note, obtaining the best rank for Relevance and 2nd best for Factuality.

Systems Overview
Three teams have made in total 7 attempts for Subtask 2. A brief description of their respecitve approaches are as following: LHS712EE (Liu et al., 2023) The team carried on with the LED (Beltagy et al., 2020) model trained on the PLOS dataset from Subtask 1 to test the generalizability of their approach in generating lay summaries coupled with a pre-trained LED model for abstractive summaries.They later retrained the model using the abstract section of the dataset to improve performance in generating technical abstracts.
Pathology Dynamics (Al-Hussaini et al., 2023) As the abstract with the most salient information is no longer present in the input, to tackle the long context input, the team trained a base LSG model (Condevaux and Harispe, 2022) and truncated each article to the first 4096 tokens for generating both abstracts and lay summaries.The model was then trained on a merged dataset that uses each article twice, with one output having the lay summaries and the other having the abstract.They also reported using simplification procedures such as MUSS (Martin et al., 2022) to enhance the lay summary or other instruction-following models such as T5 with different prefix for summarisation.(Chen et al., 2023) This team made use of different models for each submission, including Primera, a PEGASUS model pre-trained on PubMed, and a BART-large Longformer model.

Results
In Table 4, the performance of the submissions to Subtask 2 is shown on the leaderboard by each team according to the defined task metrics.Table 5 presents the overall and by individual metric rankings of these submissions following the application of the evaluation process described in §4.Due to the overall ranking scheme and the limited number of participants, we have all three teams ranked first while demonstrating advantages and disadvantages in different aspects.
All three teams utilise augmented transformers that can take longer input context, which significantly boosts the performance of Rouge score while also achieving smaller readability differences.We assume this is because longer input enables the models to see more lexicons that can be used to build summaries, resulting in a better chance to overlap with the reference summaries.However, these improvements do not necessarily promise higher results on LM-based metrics such as BERTScore and BARTScore on which the baseline method prevails.
It is worth noting that Team Pathology Dynamics used summaries generated from a LSG model simultaneously trained on both plain language as well as technical references and get output as a hybrid of the lay summaries and abstracts.Their methods obtains the highest readability and joint overall highest scores, suggesting the limitation of the readability metrics used for evaluation.In addition, they reported that neither simplification model nor small-scale instruction-following models succeed to improve performance in this task.
In conclusion, none of the participating team secured a sweeping superiority across the three evaluated aspects, highlighting the challenge in readability-controlled summarisation on relatively small-scaled language models.Given that LLMs (Large Language Models) better align with human instructions (Ouyang et al., 2022), we expect future work to investigate their capabilities in the task.

Conclusion
The first BioLaySumm shared task was hosted at the BioNLP Workshop @ ACL2023 and consisted of two subtasks focusing on Lay Summari-sation and Readability-controlled Summarisation, respectively.The task attracted a total of 20 teams, between them making 56 individual submissions across both subtasks.Submissions were evaluated according to three general criteria -Relevance, Readability, and Factuality -with each criteria consisting of one or more automatic metrics.
The results of both subtasks show that achieving strong performance for all three criteria (relative to other submissions) was particularly rare, attesting to the challenging nature of generating lay summaries for research articles in both controlled and non-controlled settings.Furthermore, when also taking into account the relatively strong performance of the BART baseline models (in particular for the Factuality component of our evaluation), this suggests that further research effort is required to develop truly usable models that can be reliably deployed in real-world settings.
However, as demonstrated by highly-ranked teams MDS and Marsfield_SDS (who obtain first and joint third-ranking submissions for subtask 1, respectively), recent developments in the abilities of both general-purpose and in-domain LLMs have the potential to offer significant benefits for the automatic generation lay summaries.As such, we expect that utilising such models for summary generation, data augmentation, and evaluation to be promising future directions for Lay Summarisation.

Table 2 :
Subtask 1 leaderboard -all metrics.The # column denotes the number of models used -1 (unified) or 2 (one for each dataset), and the + column denotes the use of additional training data."-" indicates that the corresponding information was not provided.R = ROUGE F1, BERTs = BERTScore, FKGL = Flesch-Kincaid Grade Level, DCRS = Dale-Chall Readability Score, BARTs = BARTScore.removal and acronym resolution.Extractive summarisation based on similarity-based sentence classification is then used to shorten the input before the resulting text is enhanced with the injection of complex concept definitions from Wikipedia.Finally, abstractive summarisation is performed using a fine-tuned BART model pre-trained on PubMed on a dataset-balanced sample of the training data (4K training instances from each dataset).Marsfield_SDS(Sim et al., 2023) Using two finetuned FLAN-T5 models (one for each dataset) as the backbone of their experiments, this team experimented with different data augmentation strategies including the use of ChatGPT for paraphrasing existing lay summaries.

Table 1 :
Data split sizes for each dataset.* denotes that this split is different for each subtask.