Medical Text Simplification: Optimizing for Readability with Unlikelihood Training and Reranked Beam Search Decoding

Text simplification has emerged as an increasingly useful application of AI for bridging the communication gap in specialized fields such as medicine, where the lexicon is often dominated by technical jargon and complex constructs. Despite notable progress, methods in medical simplification sometimes result in the generated text having lower quality and diversity. In this work, we explore ways to further improve the readability of text simplification in the medical domain. We propose (1) a new unlikelihood loss that encourages generation of simpler terms and (2) a reranked beam search decoding method that optimizes for simplicity, which achieve better performance on readability metrics on three datasets. This study's findings offer promising avenues for improving text simplification in the medical field.


Introduction
In recent years, text simplification has become an increasingly useful application of AI (Stajner, 2021) particularly in healthcare (Carroll et al., 1998;Saggion et al., 2015;Orȃsan et al., 2018), where text can be technical and difficult to understand.By automating this process, we can help healthcare professionals explain key medical texts (e.g.doctor's reports, findings) to patients.Previous work in text simplification in medical domain has explored use of pretrained language models (Devaraj et al., 2021;Sun et al., 2023;Martin et al., 2022;Trienes et al., 2022;Basu et al., 2023;Joseph et al., 2023b;Lu et al., 2023), reinforcement learning (Phatak et al., 2022), and zero-shot prompting (August et al., 2022;Joseph et al., 2023b).Despite this progress, simplification sometimes results in the generated text having lower quality and diversity (Devaraj et al., 2021;Phatak et al., 2022).Further as we find some simplification models copy sentences from the source, and thus remain do not sufficiently improve the readability (See Appendix B).
In this work, we seek to further improve medical text simplification.We first propose a new unlikelihood loss that penalizes words in proportion to their reading level using a well-established readability index.Second, we propose a modified beam search method at decoding time to rerank intermediate candidates based on their readability.Despite simplicity, our methods improve readability based on automated metrics (up to 2.43 points on Flesch-Kincaid) and human evaluation, while maintaining similar performance in terms of factual consistency and overall simplification.
We make the following contributions: (1) We propose a new form of unlikelihood loss based on well-established readability index to improve medical text simplification (2) We propose a decoding strategy that optimizes for readability in medical text simplification (3) We provide evaluation results for previous state-of-the-art on three datasets in terms of readability and factual consistency.We make our code publicly available at https://github.com/ljyflores/simplification-project.

Related Work
Text simplification research primarily focuses on sentence-level (Xu et al., 2015;Specia and Paetzold, 2017;Sulem et al., 2018;Srikanth and Li, 2020;Shardlow and Alva-Manchego, 2022), with some attempts at paragraph or document-level datasets (Sun et al., 2021;Laban et al., 2023).Most datasets have been sourced from accessible Wikipedia or News articles, which are already quite accessible.However, the medical field, laden with technical jargon, can greatly benefit from simplification.Initial methods in medical text simplification employed lexical and syntactic techniques (Llanos et al., 2016;Abrahamsson et al., 2014), while recent work includes finetuning language models like BART (Devaraj et al., 2021;Lewis et al., 2020) and a two-stage summarizethen-simplify approach (Lu et al., 2023).Medical simplification has also expanded to multilingual settings (Joseph et al., 2023b).
To improve simplification, we also intervene at the decoding stage.Previous work uses modified decoding methods to address factual inconsistency (Shi et al., 2023;King et al., 2022;Sridhar and Visser, 2022), or optimize fluency and diversity in text generation (Kriz et al., 2019;Hargreaves et al., 2021).Our work extends this by optimizing the decoder for readability in medical text simplification.

Methods
We propose two simple but effective approaches for improving medical text simplification, one during the training phase, and the other during decoding.Specifically, we propose a modified Unlikelihood Loss (Welleck et al., 2020) to incorporate readability index and encourage the model to favor the generation of simpler words.Then, we introduce a decoding approach that evaluates and re-ranks the candidate beams by considering both readability and factuality.We detail these approaches below:

Unlikelihood Loss for Simplification
Unlikelihood loss (UL) (Welleck et al., 2020) is a training objective that forces unlikely generations to be assigned lower probability by the model (See Figure 1).
Readability UL Following prior work (Devaraj et al., 2021) we can use this loss to force the model to assign a lower probability to complex words.Unlike Devaraj et al. (2021), we use the Flesch-Kincaid (FK) readability score (Kincaid et al., 1975) instead of model-predicted scores.The Flesch-Kincaid readability score is a numerical indicator that assesses the complexity of a text by estimating the US grade level needed for comprehension.Because FK considers syllable count and average phrase length, it serves as a good proxy metric even for incomplete sentences, by prioritiz-ing text with shorter words and shorter phrases.We incorporate this score as follows: At generation step t, we identify the word v in the vocabulary with the largest output probability; this is the word which the model is most likely to output at step t.We compute the token-level UL for v by taking the product of the word's Flesch-Kincaid score and its standard UL term log(1 − p(v|ŷ <t ).The total UL (UL R ) is the sum of the token-level penalties.
where 1 v,t indicates whether word v has the largest output probability in the vocabulary at step t, and F K v is the Flesch-Kincaid score of word v.
Consistency UL As we discuss in §4, we find that UL R alone leads to hallucinations, hence we also penalize the model for generating unsupported words in some set e using an additional factual consistency UL (UL C ).
where 1 v,e is an indicator for whether the word v is in the set of hallucinated words e.
We determine the set e as follows: we identify the sequence which the model is most likely to generate, by finding the tokens with the highest logits at each generation step.Then, we then filter this set to the tokens which do not exist in either the input text nor label.At this point, the set contains all words which the model is likely to generate, but are not present in the input/label.Hence, it may contain words which are factually or grammatically correct, but don't match the gold summary.We'd like to penalize only the tokens which we are sure are factually incorrect, hence we filter this set down to just entities using Spacy en_core_web_lg NER models (Honnibal and Montani, 2017), which results in the entity set e.

Overall Loss
The overall loss is a weighted sum of the negative log-likelihood (L NLL ) and UL, where λ R and λ C are constants.Readability Score We optimize candidates' readability during decoding using Flesch-Kincaid (FK) Grade Level scores.FK represents the readability of a text measured by US grade level; hence, lower scores are more readable (Kincaid et al., 1975).These typically range from 0 to 18, but can extend past this range in practice.We compute FK of candidate beams and cap it from 4 to 20, as we find that qualitatively, beams with scores below 4 as equally simple, and above 20 as equally complex.Then, we normalize the score r F (s) from 0 to 1, such that 0 is least readable, and 1 is most readable.
Consistency Score Like in UL training, we find that optimizing solely for readability in decoding may introduce hallucinations; hence we balance readability with consistency, as measured by BERTScore (Zhang et al., 2020).We find that beams with scores below 0.60 to have equally poor factuality, hence we cap the score r B (s) between 0.60 and 1.00 and normalize it.
Composite Score We compute a composite score r(s) using an F1-like metric.Note that the score is merely used to rerank the candidates.
Ranking Every k Steps Computing metrics at each generation step can be inefficient, and the meaning or readability of the beam might not change after adding just one word.Hence, we reduce the frequency with which we perform the reranking to intervals of k (See Appendix E).
Hallucination Heuristic We implement a heuristic to remove beams with unsupported entities.We identify entities with the Spacy en_core_web_lg NER model (Honnibal and Montani, 2017), check if the entities appear in the source, and set the beam's score as zero if any of the entities are not.

Experiments
Datasets We run our experiments on three datasets: Cochrane (Devaraj et al., 2021) consists of 4,459 pairs of abstracts from the Cochrane Database of Systematic Reviews and their corresponding summaries written by domain experts.MedEasi (Basu et al., 2023) consists of 1,697 pairs of human-annotated sentences sourced from the Merck Manuals (Cao et al., 2020) and SimpWiki (van den Bercken et al., 2019).Finally, the Radiology Reports Dataset 1 consists of 2,269 radiology reports collected from a large urban hospital and simplified by medical residents and doctors.
Baselines We compare against a BART-XSum (Lewis et al., 2020) model which we further finetune on our datasets, and state-of-the-art models by Lu et al. (2023); Devaraj et al. (2021), all of which we fine-tune on each of the three datasets; we chose BART-XSum to align it with previous work, in order to provide an apples-to-apples comparison and isolate the impact of our methods.We also compare with state-of-the-art large language model GPT-4-0314 (OpenAI, 2023) 2 .
Evaluation Metrics We evaluate the readability, consistency, and overall performance as follows: 1 Internal dataset 2 We set the system's role as "You are a helpful assistant that simplifies text", and the prompt as "Simplify this text:".For readability, we use the standard FK (Kincaid et al., 1975) and ARI scores (Smith and Senter, 1967), which use the average word and sentence length to estimate the complexity of texts.
For factual consistency, we use BERTScore (Zhang et al., 2020) and GPT-Eval (Liu et al., 2023) (See Appendix D), as these correlated well with human judgement (Scialom et al., 2021;Li et al., 2022;Liu et al., 2023).For GPT-Eval, we evaluate 50 summaries, and report the fraction of samples in which a factual inconsistency was found.

Results
We fine-tune a BART model using our methods and present the results in Table 1; see Appendix A for implementation details.

Effect of Unlikelihood Loss and Decoding On
Cochrane and Radiology, our proposed methods achieve better readability scores in terms of FK and ARI.In particular, combining unlikelihood loss with the decoding strategy achieves a 2.43/1.74point improvement in FK/ARI upon the next best model for Cochrane, and a 0.12/0.17point improvement for Radiology.Note that in the radiology dataset, the sentences are typically short, resulting in a lower (better) baseline readability score.See sample comparison of outputs in Appendix B.
On MedEasi, our methods slightly underperform NapSS (Lu et al., 2023).We find that it sometimes generates phrases instead of full sentences, which lowers FK/ARI, since these scores depend on sentence length.In contrast, our models generate complete sentences, which improve fluency at the cost of worse (i.e.higher) FK/ARI scores.
Our methods generally improve over the prior SOTA in terms of SARI and BERTScore, however, interestingly on the radiology dataset all methods underperform a fine-tuned BART model.
We observe that using UL or the decoder indi-vidually results in fewer hallucinations than both BART-UL (Devaraj et al., 2021) and NapSS (Lu et al., 2023) on Radiology, and against NapSS on MedEasi.When the baseline models perform well, we find that it is because they tend to copy information from the input, and hence are less prone to hallucinations.In contrast, our strategies force the model to use simpler words and not copy the input, but may introduce inconsistencies with the source.We confirmed this with an experiment: we compute the % 4-gram overlap of the model written summaries with the source, and observe that large portions of previous works' output is copied from the text, whereas output in our models are not (See Table 3).Note that some of the identified hallucination errors are relatively minor as we find GPT-Eval to be very strict.For example the phrase "26 selftreatments of 26 Chinese herbal medicine prescriptions" is found to be factually inconsistent with the source having the phrase "26 self concocted Chinese herbal compound prescriptions" by GPT-Eval (see Table 11 for full example).
Human Evaluation We conduct a human evaluation study to further investigate the results (See Table 2).We observe that our proposed UL and decoder improves readability over a fine-tuned BART-XSum model 43% and 27% of the time, whereas the previous SOTA NapSS (Lu et al., 2023) only demonstrated clear benefits 3% of the time.However, GPT-4 achieves the best performance, mainly because it is trained on human preference data and omits minor details, only keeping the main summary.In contrast, our models and previous SOTA tend to retain these minor details from the source, which human evaluators may find irrelevant.
We note that the low interrater agreeability aligns with the ranges reported in previous work (Goyal et al., 2023), which reflects the subjective nature of human preference, given that simplicity and readability varies based on one's technical background and style preferences.While such variability is  (Fleiss, 1971), α is Krippendorf (Passonneau, 2006).
hard to avoid, the average proportions suggest that overall, our methods significantly improved upon previous SOTA (NAPSS).

Effect of Individual Unlikelihood Losses
We test using U L R and U L C separately (See Table 4).U L R alone results in good readability but poor factual consistency, and vice versa for U L C , justifying the need for both losses to be used in conjunction.

Conclusion
In this paper, we propose methods to improve simplicity in medical text simplification; this improves the readability of generated summaries, and achieves comparable BERTScore and SARI scores.However, hallucination remains a challenge.

Model % 4-Gram Overlap
BART-XSum 52.88% BART-UL (Devaraj et al., 2021) 39.30% NAPSS (Lu et al., 2023) 51.77% UL (Ours) 15.73% Decoder (Ours) 9.80% We explored augmenting the data with external knowledge (See Appendix C.2), but found no benefit.This may be because the sources and labels in the training data contains inconsistencies (Lu et al., 2023), which require further preprocessing.Addressing such hallucinations to generate more robust summaries is a critical future direction in medical text summarization, which we aim to explore further.

Limitations
One limitation of our work is the persistence of hallucinations in the output.Previous literature has shown that this often originates from inconsistencies between the source and text data.For example, a number of training labels in the Cochrane dataset (Devaraj et al., 2021) contain the phrase, "The evidence is up to date as of X", despite no mention of a date in the source (Lu et al., 2023).To this end, future work can adapt strategies from literature in summarization, which have shown that preprocessing (Adams et al., 2022;Wu et al., 2022) and augmenting (Yang et al., 2023) the data can mitigate such hallucinations.
Another limitation is our paper examines medical text simplification very broadly, whereas there may be expert knowledge needed to improve specific tasks.Hence, future work can analyze such methods on a more niche set of datasets (e.g.medical literature, patient reports, health-related news).Such work can be extended to other languages, for which multiple medical text simplification datasets have been developed (Trienes et al., 2022;Grigonyte et al., 2014;Cardon andGrabar, 2019, 2020;Joseph et al., 2023a).
Finally, we note that our inter-annotator agreement on the task of readability is particularly low; this reflects both how human preferences are diverse and how the task is highly subjective, as has been shown in other domains (Goyal et al., 2023).Moreover, readability not only differs by person, but also by domain and task.Future work can define domain-specific criteria, and recruit participants from the exact target populations which the text is meant to be simplified for.

Ethics Statement
We use publicly available datasets and make our preprocessing and training scripts available.As mentioned in the limitations section, both our methods and previous methods still exhibit varying degrees of hallucination, and have yet to undergo domain-specific examination.Hence, we do not recommend these models be applied in a practical setting at the moment.

A Implementation Details
We train a baseline BART-XSum model (Lewis et al., 2020) on Cochrane, MedEasi, and the Radiology Dataset.We implement the unlikelihood loss and modified decoder using the Transformers library (Wolf et al., 2020); we report the hyperparameters in Table 5.We run our experiments using NVIDIA-RTX 6000 GPUs.

B Example Output
Tables 6, 7, and 8 show comparisons of outputs from the previous SOTA vs our model.We clearly observe the benefits of our methods; the writing is much simpler, and complex phrases such as "asthma exacerbation" and "emergency department presentation" have been replaced by "asthma attack" and "coming to the emergency department".Table 8 shows an instance wherein the writing is much simpler, but the model tends to retain much more information about the source and explain other concepts (in italics); this may come across as redundant to some evaluators, which explains the results in the human evaluation portion, when compared to models like GPT-4.

C.1 Varying Candidate Set e
Kindly refer to Table 10 for a comparison of performance when we exclude only words in the input (U L I ), the label (U L L ), or both (U L IL ).

C.2 External Knowledge Augmentation
We identify complex terms using SciSpacy (Neumann et al., 2019), obtain their meanings from UMLS (Bodenreider, 2004), and insert these meanings in the sentence.An example is provided in Table 9. Kindly refer to Table 10 for a comparison of performance with and without external knowledge (EK).

D GPT-Eval Implementation
We follow Liu et al. (2023) to implement GPT-Eval for factual consistency with GPT-4, as this was found to correlate best with human judgment.We use the gpt-4-0314 version, with n, top p , and temperature set to 1.We set the system role as "Your task is to rate the summary on one metric."For the user input, we input the following:

BART-UL
This review of 38 studies involving 7843 children found that asthma education aimed at children and their carers who present to the emergency department for acute exacerbations can result in lower risk of future emergency department presentation and hospital admission.There were also fewer unscheduled doctor visits.Very few data were available on other markers of asthma morbidity such as quality of life, symptoms and lung function.It remains unclear as to what type, duration and intensity of educational packages are the most effective in reducing acute care utilisation.

Ours
This is an overview of studies that looked at the long-term effects of providing education to children and their carers who have a flare-up of their asthma.The aim was to see if there was a benefit in terms of the number of children who had to go to hospital for an asthma attack.A total of 38 studies involving 7843 children were included in this review.The results showed that if children were taught about asthma, their parents or both, they were less likely to come to the emergency department for an attack.There was also less chance that they would need to go back to hospital.Very few data were available for other outcomes (e.g.quality of life, symptoms and lung function) and there was no difference between education and control.However, there is a need for more research in this area.

Input
Five trials of MSP/RESA vaccine with 217 participants were included; all five reported on safety, and two on efficacy.No severe or systemic adverse effects were reported at doses of 13 to 15 5g of each antigen (39 to 45 5g total).One small efficacy trial with 17 non-immune participants with blood-stage parasites showed no reduction or delay in parasite growth rates after artificial challenge.In the second efficacy trial in 120 children aged five to nine years in Papua New Guinea, episodes of clinical malaria were not reduced, but MSP/RESA significantly reduced parasite density only in children who had not been pretreated with an antimalarial drug (sulfadoxine-pyrimethamine).Infections with the 3D7 parasite subtype of MSP2 (the variant included in the vaccine) were reduced (RR 0.38, 95% CI 0.26 to 0.57; 719 participants) while those with the other main subtype, FC27, were not (720 participants).The MSP/RESA (Combination B) vaccine shows promise as a way to reduce the severity of malaria episodes, but the effect of the vaccine is MSP2 variant-specific.Pretreatment for malaria during a vaccine trial makes the results difficult to interpret, particularly with the relatively small sample sizes of early trials.The results show that blood-stage vaccines may play a role and merit further development.

NAPSS
Five trials of MSP/RESA vaccine with 217 participants were included; all five reported on safety, and two on efficacy.No severe or systemic adverse effects were reported at doses of 13 to 15 µg of each antigen (39 to 45 µg total).One small efficacy trial with 17 non-immune participants with blood-stage parasites showed no reduction or delay in parasite growth rates after artificial challenge.In the second efficacy trial in 120 children aged five to nine years in Papua New Guinea, episodes of clinical malaria were not reduced, but the vaccine significantly reduced parasite density only in children who had not been pretreated with an antimalarial drug (sulfadoxinepyrimethamine).Infections with the 3D7 parasite subtype (the variant included in the vaccine) were reduced (RR 0.38, 95% CI 0.26 to 0.57; 719 participants) while those with the other main subtype, FC27, were not (720 participants).The effect of the vaccine is MSP2 variant-specific.Pretreatment for malaria during a vaccine trial makes the results difficult to interpret, particularly with the relatively small sample sizes of early trials.

BART-UL
This review of five trials found that the MSP/RESA (Combination B) vaccine shows promise as a way to reduce the severity of malaria episodes, but the effect of the vaccine is MSP2 variant-specific.The results show that blood-stage vaccines may play a role and merit further development.

Ours
The MSP/RESA (Combination B) vaccine shows promise as a way to reduce the severity of malaria episodes, but the effect of the vaccine is MSP2 variant-specific.
The results show that blood-stage vaccines may play a role and merit further development.

Figure 1 :
Figure 1: Training Diagram for Computing Unlikelihood Loss

Figure 2 :
Figure 2: Diagram for Modified Beam Search for Decoding for Simplification

Table 2 :
Human Evaluation Results on 30 Examples from Cochrane, Readability is the % of instances where the model summary was strictly more readable than a fine-tuned BART-XSum model's summary, κ is Fleiss-Kappa interrater agreement

Table 3 :
An analysis of the % 4-gram overlap between the source text and model outputs reveals that previous models tend to copy directly from the source text, whereas our models do not, thereby simplifying and synthesizing

Table 5 :
Training Hyperparameters PEF, rescue medication use, quality of life or symptoms) and there was no statistically significant difference between education and control.Asthma education aimed at children and their carers who present to the emergency department for acute exacerbations can result in lower risk of future emergency department presentation and hospital admission.There remains uncertainty as to the longterm effect of education on other markers of asthma morbidity such as quality of life, symptoms and lung function.It remains unclear as to what type, duration and intensity of educational packages are the most effective in reducing acute care utilisation.NAPSSA total of 38 studies involving 7843 children were included in this review.Asthma education aimed at children and their carers who present to the emergency department for acute exacerbations can result in lower risk of future emergency department presentation and hospital admission.There were also fewer unscheduled doctor visits.Very few data were available for textbfother outcomes (FEV1, PEF, rescue medication use, quality of life or symptoms) and there was no statistically significant difference between education and control.It remains unclear as to what type, duration and intensity of educational packages are the most effective in reducing acute care utilisation.There remains uncertainty as to the long-term effect of education on other markers of asthma morbidity, symptoms and lung function.

Table 6 :
Sample Report 1 from the Cochrane Test Set

Table 7 :
Sample Report 2 from the Cochrane Test Set