Improving Summarization with Human Edits

Recent work has shown the promise of learning with human feedback paradigms to produce human-determined high-quality text. Existing works use human feedback to train large language models (LLMs) in general domain abstractive summarization and have obtained summary quality exceeding traditional likelihood training. In this paper, we focus on a less explored form of human feedback -- Human Edits. We propose Sequence Alignment (un)Likelihood Training (SALT), a novel technique to use both the human-edited and model-generated data together in the training loop. In addition, we demonstrate simulating Human Edits with ground truth summaries coming from existing training data -- Imitation edits, along with the model-generated summaries obtained after the training, to reduce the need for expensive human-edit data. In our experiments, we extend human feedback exploration from general domain summarization to medical domain summarization. Our results demonstrate the effectiveness of SALT in improving the summary quality with Human and Imitation Edits. Through additional experiments, we show that SALT outperforms the conventional RLHF method (designed for human preferences) -- DPO, when applied to human-edit data. We hope the evidence in our paper prompts researchers to explore, collect, and better use different human feedback approaches scalably.


Introduction
Large-scale language model pretraining has become increasingly prevalent to achieve high performance on various natural language processing (NLP) tasks (Brown et al., 2020;Sanh et al., 2021;Chowdhery et al., 2022;Longpre et al., 2023;Ope-nAI, 2023;Cai et al., 2023).When applying these models to a specific task, they are usually finetuned to maximize the likelihood of human-written text.While this strategy has led to markedly improved performance in many metrics, models still cannot consistently produce human-determined high-quality output.The NLP community has pointed out some key drawbacks of traditional finetuning.First, important errors (e.g.hallucinations) and unimportant errors (e.g.minor grammar errors) equally contribute to the final loss.Second, the model weighs the loss equally on all labeled data of different types, qualities, and difficulties.Third, distribution shifts in new data degrade performance (catastrophic forgetting) (Kirkpatrick et al., 2017).

CC
Ground truth summary DR: Plus ribavirin, roughly based on your weight.Like 3 pills in the morning, 3 pills in the evening.Some works tackle these problems with human feedback (HF).Specifically, they fine-tune language models with HF using reward learning (Stiennon et al., 2020;Ziegler et al., 2019).With a large amount of HF data, these works demonstrate that large-scale LMs, such as GPT-3 (Brown et al., 2020), have a text generation quality exceeding traditional likelihood training.However, the acquisition cost of large-scale HF is high, and whether smaller LMs can also benefit is not fully studied.In addition, because LLMs are often provided in the form of third-party APIs and are too large for many companies' and labs' infrastructure to host, smaller models (e.g., T5 family (Raffel et al., 2020)) still play important roles in many domains (e.g., medical), where privacy issues and pragmatic economics dominate decision-making strategies.
Our goal in this paper is to explore methods to train language models to improve the summary quality with HF inexpensively.HF for summarization can come in different forms.One is to obtain human scores for the summaries.Previous work (Stiennon et al., 2020;Ziegler et al., 2019) focuses on training a reward function through HF data and using such rewards as training objectives by comparing different summaries' scores.More recently, this is used by generative AI works (e.g., ChatGPT and GPT4 (Ouyang et al., 2022;OpenAI, 2023)), and they call the method RLHF.Another HF is obtaining edits to make the summary correct.The second approach is a natural way to collect feedback from users in workflows where users may be working off of an AI-generated summary in their workflow.For example, the summaries S E , in Table 1 are the results of clinicians/scribes modifying our AI-generated EHR summaries S AI .In addition, the second approach might be more data efficient in improving the summarization models than the first, as it conveys more granular information than a score for the entire summary.Human Edits from the second approach can also be converted to scores with simple rules like the percentage of edits, although this has not been studied extensively.Hence, from an ML data point of view, the second approach has certain unique advantages.Furthermore, large-scale expert feedback is hard to get using annotation ways in RLHF, considering the expert/user's time, cost, and willingness.But, Human Edits, which can be obtained from the users using the AI summaries for their work, may become a more reasonable alternative in various professional-knowledge-intensive domains.
We explore how to use Human Edits to improve summary quality.In addition to general domain summarization, we also focus on a medical domain summarization task in automatic clinical note generation from doctor-patient conversations, which is understudied due to privacy and data inaccessibility problems.Table 1 provides an example of a Clinician Conversation from our dataset (CC).We present our work from two experiments on a novel technique, Sequence Alignment (un)Likelihood Training (SALT), which uses Human Edits and unlikelihood objectives together with the standard likelihood training paradigm to improve the summary quality.Unlikelihood training was proposed to reduce the probability of unlikely tokens predicted by models (Welleck et al., 2019).
In our first experiment, we use the Human Edits from physicians editing AI-generated clinical summaries from medical conversations to improve the summarization models.In our second, we explore how we can get similar benefits with pre-existing ground-truth human summaries that are not written as edits to the AI-generated summaries, which we call Imitation Edits.We refer to AI-generated summary S AI , human-edit summary S E , and imitationedit summary S I .We show how the unlikelihood objective can be generalized to improve the summary quality together with (S AI , S E ) and (S AI , S I ) pairs.In addition, our results show that SALT stably improves summary quality for T5 (small and large) summarization models with Human and Imitation Edits.Further experiments show how SALT can address the catastrophic forgetting problem arising from the distribution difference between S AI and S E with the help of RSALT, which is an improved version of the Replay-based methods in Continual Learning (Rebuffi et al., 2017).
Finally, to compare SALT and RLHF, we experiment with SALT and Direct Preference Optimization (DPO) (Rafailov et al., 2023) on human edit data and demonstrate the superiority of SALT on this type of human feedback.
To conserve space constraints, we have relegated specific contents to the appendix.In Appendix A.1 and A.2, we provide definitions of the SOAP Structure and implementation details.In Appendix A.3, we focus on the utilization of Imitation Edits and SALT for training on publicly available datasets, accompanied by the experimental results.Lastly, in Appendix A.4, we have more discussion about the relation between SALT and various other RLHFs.In summary, our contributions are as follows: • To our knowledge, we are the first to extend current HF trends in summarization research to the automatic clinical note-generation task.
• Different from the form of HF used in previous work, we explore Human Edits to improve summary quality in this paper.
• We show SALT extends unlikelihood training into a general framework using sequence alignment and further combines SALT and Replay-based methods (Rebuffi et al., 2017) into RSALT for tackling catastrophic forgetting.
• Finally, we show that SALT achieves better performance than DPO on human-edit feedback.

Related Work
Most directly related to our work is research on automatic clinical note generation from doctorpatient conversations (Schloss and Konam, 2020;Ramprasad et al., 2023;Krishna et al., 2020;Abacha et al., 2023a;Ben Abacha et al., 2023;Yim et al., 2023;Wang et al., 2023), and the difference is that those works focus on training a summarization model with pre-labeled data, while we focus on using HF further to improve the summary quality of the trained models.
Previous work used HF to train summarization models with reinforcement learning (RL) (Böhm et al., 2019;Ziegler et al., 2019;Stiennon et al., 2020) and used GPT-2 and GPT-3 to optimize HF across various summarization tasks.These RLbased methods focus on training a reward function through HF data and use such rewards as training objectives by comparing different summaries (RLHF).Recently, some RLHF variants collect or use rewards more flexibly and stably (Akyürek et al., 2023;Dong et al., 2023;Zhao et al., 2023;Yuan et al., 2023).We introduce unlikelihood training as an additional learning objective in supervised learning.Our technique aims to decrease the probability of unlikely sequences, defined as those which appear in the S AI but not in S E , and increase the probability of verified sequences, which are in S AI and reinforced by S E , as well as novel sequences which do not appear in S AI but do appear in S E .
Unlikelihood training (Welleck et al., 2019) involves adding unlikelihood loss to lower the probability of negative candidates.Previous work has explored many scenarios with various negative candidates for unlikelihood training, including: style transfer (Devaraj et al., 2021), repetition, copying, and contradictions (Li et al., 2019), factuality (Cao and Wang, 2021), text degeneration (Su et al., 2022), and clinical summarization (Adams et al., 2022).In this work, we align the S E with S AI to identify negative candidates and train different tokens with unlikelihood and likelihood loss.We also show that our experiments on Human Edits can be extended to Imitation Edits to reduce the need for HF data which can be expensive to get.

Clinician Conversations (CC) Dataset
This dataset is a collection of 63000 consented doctor-patient de-identification conversations with human transcripts with an average duration of 9 minutes.We segmented the dataset to create training, validation, and test sets of 52,000, 5,000, and 6,000 files each while controlling important characteristics of the distribution in each split.The transcripts of the conversations were annotated according to the traditional SOAP format2 .A SOAP note can contain numerous observations that are grounded to shorter excerpts from the transcript via timestamps that relate back to the original audio.There are several sections and subsections in the SOAP structure, each of which needs specific information and is written in a different format.Table 2 shows the average length span of different subsections is large.

CCUser Dataset
In order to generate SOAP notes from doctorpatient conversations, our pipeline follows (Ramprasad et al., 2023;Krishna et al., 2020).We first record the clinical conversation, then transcribe it either using humans or using Google's medical-conversations Automatic Speech Recognition (ASR) service.Then, using our proprietary models, we classify utterances into SOAP sections.Finally, using our section-conditioned summarization model trained on the CC dataset, we generate summaries for each of the utterance clusters belonging to each section.We use our pipeline to extract SOAP summaries for our clinician users who record their conversations with their patients via a mobile app.The generated summaries were edited by scribes and doctors using our dashboard for their documentation tasks.The dashboard is built for doctors and scribes to check and fix AI-generated summaries in their regular workflow quickly.Hence, we didn't enforce any training/instructions that might make the data more useful for research, and the users were free to use the dashboard as they saw fit.
The distribution of the CCUser dataset differs from the CC dataset in the following ways.First, CC uses human-written transcripts as training inputs, while CCUser uses our pipeline's inputs from ASR transcripts rather than human-tagged utterances.Second, the average length of a conversation was 20 min for CCUser compared to 9 min for CC dataset, which could mean more complex conversations.The dataset has 215 ASR transcripts with AI-generated notes (along with the Human Edits) from 10 physicians.We randomly select 70 notes from 7 physicians as a training dataset, 10 for each physician, and divide the remaining 145 notes into evaluation and test sets.Finally, our dataset is split as a train:eval:test = 1279:1457:1458 -(utterance cluster, edited summary, AI summary) triplet.

Methods
Given a tokenized utterance cluster as input The user edits this summary from S AI to S E , where S E = [y 1 , y 2 , y 3 , ...y lenS E ].We aim to update parameters in M based on both S AI and S E .Let, lenU , lenS AI , and lenS E be the number of tokens in U , S AI , and S E respectively.

Sequence Alignment (un)Likelihood
Training (SALT) using S AI and S E When a user edits a summary from S AI to S E , they can modify or delete a span of tokens, insert a new span of tokens, or not change anything to a span of tokens.We want to use these Human Edits to improve our summarization models and produce outputs that are closer to the user's modified summary than before.We do this using both S AI and S E in the training.We train the model to: (i) Lower the probability of producing words that the user deleted or modified in S AI .
(ii) Reinforce the probability of producing words that the user didn't change in S AI and are retained in S E .(iii) Increase the probability of producing words that the new user added in S E .The loss functions to train the summarization model with S AI and S E : (1) (3) Where: 1. U is the utterance cluster used as input 2. C and N C mean "changed" and "not changed" tokens when we align S AI and S E sequences.
3. 1 AI−C (t) and 1 AI−N C (t) are the indicator function to signify if the token x t in S AI is changed or not-changed by the user.Similarly, 4. w x are the loss weights, for example, w AI−C is the weight to penalize tokens that are in S AI but not in S E . 5. L r (x, t) and L p (x, t) are the likelihood and unlikelihood loss functions The losses L S AI and L S E used in the (S AI , S E ) pair are used to train the summarization model.The indicator functions used in the above equations can be found by tracking the user changes as they edit the summary or by aligning S E to S AI using a sequence alignment algorithm.We use sequence alignment (the Needleman-Wunsch Algorithm (Needleman and Wunsch, 1970)) in this work because our dashboard doesn't log the users' keystrokes.Assume we have a pair from S AI and the corresponding S E , "patient takes one aspirin daily" and "patient doesn't want to take aspirin".We can align these two sentences as below: Where "C" is "Correspondence" (matching), "I" is "Inserted", "D" is "Deleted", and "S" is "Substituted".Note that we do it on the token level in the implementation.For S AI word list ["patient", "takes", "one", "aspirin", "daily"], the corresponding indicator function in Equation 1 are: For S E word list ["patient", "doesn't", "want", "to", "take", "aspirin"], the corresponding indicator function in Equation 2 are: 1, 0, 0, 0, 0, 1] 4.2 Imitation Edits S E is a special kind of ground truth summary from the user.S E is obtained by the user using U and S AI -S E = F n(U, S AI ).An interesting question is whether we can approximate the edited summary S I (Imitation Edits), and use it to improve the models in the absence of actual Human Edits with SALT.In our work, we use the pre-existing ground-truth summaries as S I even though they were not explicitly written as edits to S AI .Leveraging such data has several advantages.First, S E is not easy to obtain, approximating S E with S I can increase the amount of data available for unlikelihood training.And we will be able to use SALT even without human-edit data or any new annotations.Second, under the premise of ensuring that the Imitation Edits are of high quality, combining Human Edits and Imitation Edits can further improve the model's performance since both of them bring effective data points for training.Third, Imitation Edits can be used to solve the forgetting problem when we do SALT training with S AI and S E , we show this in the next section.
To imitate Human Edits, we assume the original ground truth summary is generated from S AI and its utterance cluster U (even though the ground truth notes were written independently).Similar to the above setting with S AI and S E , we use the alignment algorithm to align S AI and S I .Then we calculate L S I .
where 1 I−C (t) and 1 I−N C (t) signify if the token x t in S I is changed or not-changed compared to S AI , and w x are the loss weights.

Replay-based SALT (RSALT) for Catastrophic Forgetting Problem
We continue training the model M that has converged in the original summarization dataset (e.g., CC) on the Human Edits dataset (e.g., CCUser) to improve the summary quality, subjecting the model to the catastrophic forgetting problem because of the distribution differences between them.We use the traditional Replay-based methods, (Rebuffi et al., 2017), which sample a part of the data from the seen dataset (e.g., CC) and add it to the unseen data (e.g., CCUser), to address the catastrophic forgetting problem.Here, the likelihood loss is calculated for both sampled seen data S I(seen) and human-edit data S E(unseen) with the loss function L = M LE S I(seen ) + M LE S E(unseen) , where we use Maximum Likelihood Estimation for the loss.
Following Section 4.1, we can use both S AI(unseen) and S E(unseen) to do SALT training.Following Section 4.2, for the sampled previously seen data, we can also get (S AI(seen) , S I(seen) ) pairs and do SALT training.According to Equations 1, 2, 5, the loss function with RSALT is 5 Metrics ROUGE and UMLS-F1 Models are evaluated with full-length F1-scores of ROUGE (Lin, 2004).We use QuickUMLS3 to extract medical concepts from both model-generated and ground truth summaries and then calculate F1-scores for these two lists of concepts, which is named UMLS-F1 (Adams et al., 2023;Ramprasad et al., 2023).
GPT4 & Human preference Recent work shows a higher correlation between human and GPT4 evaluation than traditional metrics (Moramarco et al., 2022;Gao et al., 2023;Fu et al., 2023), so we also use GPT4 preference as measurements to evaluate summary quality.Specifically, we instruct GPT4 to give preference ranking on different AI-generated summaries based on the conversation snippet and reference summary4 .Similarly, we asked 2 medical students5 to rate summaries from CC based on the same information, for privacy reasons, we did not evaluate CCUser with humans.We discuss the Mean Reciprocal Rank (MRR) (Radev et al., 2002) of different models in Section 6.4.Generally, a higher MRR value implies that evaluators have more preference over an approach.
SAGE ROUGE and UMLS-F1 measure the degree of "likelihood," i.e., they evaluate whether or not the model can generate something closer to some references.However, we don't just want to know how much "closer to S E " is newly generated summary, but also how "far away from the bad part of S AI " -spans that are changed by the Human Edits.To address this problem, we design an evaluation method to measure how likely machines are to make the same mistakes as before and how likely they are to generate summaries more like the target users (as identified during the editing process By training on HF, we aim to have S new closer to S E while avoiding the mistakes found in S AI .So SAGE counts how many words in S new are in G w1(AI−E) , G w2(E−AI) , and G w3(AI∩E) .We call this word level SAGE (SAGE w ).Similarly, we can define and make Concept-level SAGE (SAGE c ) based on UMLS concept overlap in S new , S AI , and S E .
We have two assumptions regarding SAGE: 1. users can accept machines making some mistakes, but they can't tolerate machines making the same mistake, again and again.2. users will be more satisfied if the model, over time, learns to generate outputs more similar to the user's edited summaries According to Assumption 1 and 2, a model trained on HF should be able to generate less content belonging to G 1 (G w1 and G c1 ), and more content belonging to G 2 (G w2 and G c2 ).The model should also be able generate G 3 (G w3 and G c3 ) since G 3 represents human-verified information.

Experiments
We use the following symbols: 1. M refers to models that are trained and al-

Reducing the forgetting problem
In Table 3, we see a dip in evaluation metrics for SALT l in the old evaluation dataset CC eval when we train the model trained on the CCUser -catastrophic forgetting.The reason could be the distribution difference between CCUser and CC dataset described in Section 3.2.Both SALT u and SALT l+u have different degrees of improvement in ROUGE-1 and UMLS-F1 on CC eval data.This result shows that SALT training also alleviates the forgetting problem to a certain extent.
One widely used and effective technique to reduce catastrophic forgetting is the replay-based method, which mixes in the seen data the model was trained on (e.g., CC).In this work, we set the ratio of CCUser and CC data to 2:1.That is, assuming that there are n CCUser data, we will sample 0.5 * n CC data to train together7 .Table 3 shows that SALT x +RSALT l is effective in helping the model reduce the catastrophic forgetting problem.Adding the sampled seen data improves the model's performance in both the new -CCUser and the original -CC data.However, we still see a reduction in the performance of SALT x +RSALT l in the CC dataset compared with M , which shows that the traditional replay-based method cannot completely solve this problem.In Section 6.3, we show how we address the problem further with SALT, imitation-edit data, and RSALT.

SALT in imitation-edit dataset
SALT uses the relationship between S E and S AI to get better performance than using just S E and likelihood training.In this section, we show that we can directly improve the summarization model M using a similar relationship between S I (the ground truth data) and S AI without new human-edit data or additional annotation, i.e., by assuming that the S I is the output of human-edit data on S AI .Simulating Human Edits this way lets us 1) demonstrate the effectiveness of SALT on a public dataset that does not have the human-edit component in them, 8and 2) reduce the amount of Human Edits needed as it is hard to get.
Although both come from humans, S E and S I are fundamentally different in their relationship with S AI .The former is modified from S AI while humans generate the latter from scratch.There- fore, S E is directly dependent on S AI , but S I is not.Consequently, even though S E and S I are dependent on the same data as input, the differences between S AI and S I are likely to be larger than between S AI and S E .We can see this difference in the average percentage of changed tokens -1 E−C and 1 I−C is 1, the former (6.17%) is much lower than the latter (45.59%).Hence, after we do sequence alignment between S I and S AI , we perform a two-step post-processing operation 9 to ensure the training stability, which helps us to reduce the percentage of changed tokens from 45.59% to 19.07% with an acceptable amount of data lost (21.38%).

Imitation Edits using seen data
We use the training data from CC to experiment with the effects of SALT and Imitation Edits on seen data.First, for the CC dataset, the results in Table 5 show that continuing to use likelihood loss on the training dataset to train the already convergent M does not improve the performance and leads to overfitting.However, when we use S I as imitation-edit data and do SALT training on it with S AI , we can see an improvement.Second, we see similar results for the CNN dataset.Even though there is no performance degradation arising from overfitting for SALT l , doing SALT training with S I and S AI can improve the performance more than using just the likelihood training.These results show that we can get additional improvement on the model by continuing to train it with SALT on 9 The details are in Appendix A.3.1.the seen dataset even if the model is already converged (on the seen/original training data).Third, different from previous human-edit results, SALT u of CC is better than SALT l+u .We think this is because M has started to overfit on CC data, so continuing to add likelihood to the original training data reduces the scores.

Imitation Edits using unseen data
We use a part of the test dataset (not used in the evaluation) from CC to experiment with the effects of SALT and Imitation Edits on unseen data.In Table 6, we take M (trained on CC-train) and train it with a part of CC-test as the imitation-edit data with SALT.We take the remaining test data of the CC-test to evaluate the model performance in new imitation-edit data and then use CC-eval to evaluate the model performance in the original data.In imitation-edit evaluation results (CC test−r ) of Table 6, SALT l+u has better performance than the baseline method SALT l , which is consistent with our results using human-edit data in Table 3.In the original data evaluation results (CC eval ) of Table 6, although there was no forgetting problem arising from distribution shift, SALT l+u still has a higher score than the baseline model SALT l .

Solving forgetting problem with RSALT
Through previous analysis, we see that SALT helps M to continue training on human-edit data or imitation-edit data.In Section 6.1.2and 6.2.2, we observed that the traditional replay-based method cannot completely solve the catastrophic forgetting problem, so the performance of SALT x +RSALT l on Table 3 and 6 is still lower than M 's performance if there are distribution differences.
We report the results of SALT x +RSALT l+u in Table 3 and 6.We find that SALT x +RSALT l+u does not have the forgetting problem when continuing to train with human-edit data.We attribute this result to the data augmentation that RSALT brings to the traditional replay-based method.RSALT not just reuses the seen data to prevent the model from forgetting the learned distribution but also uses the output generated by the model itself with SALT to expand the effective training data points further.

Preference Evaluation
In CC dataset, GPT4 (on 500 data points) ranks SALT l+u +RSALT l+u higher than other variations (SALT l and SALT l+u ) and M .To verify the GPT ranking, we performed human evaluation on a smaller set (25 data points).Human ranking agrees with the GPT4 ranking.In CCUser, GPT4 (on 500 data points) ranks SALT l+u higher than other variations, which is expected as SALT l+u +RSALT l+u is also trained on the replay dataset.Because of privacy reasons, we did not do the human evaluation on CCUser.In Appendix Table 12, we show the prompt used with GPT4 for ranking the summaries.We show all the MRR scores for different models in our work in Figure 1.
7 Discussion: SALT vs RLHF First, we argue that Human Edits is a more natural way to collect feedback from users as they fix AI-generated text for their workflow to improve generation.Collecting other forms of feedback that are not directly tied to the user's workflow will not scale as much, this is especially true in domains requiring expert domain knowledge and with nuanced user goals.Considering the cost, time, and availability of the experts, it is important to collect HF from the expert's daily workflow.
Second, we experiment with Direct Preference Optimization (DPO) (Rafailov et al., 2023) to compare the difference between RLHF and SALT while using a human edit feedback dataset.The training setup of DPO and SALT are similar, they are trained directly on the human preference dataset without training explicit reward models.We use S AI as the rejected summary and S E as the chosen summary and calculate the DPO loss -L DP O , between them to train the model.
where θ and ref are the current and original model parameters.for β = {0.1,0.5} on GPT-210 (117M parameters), with Rouge, Meteor, and Reward Accuracy (Reward Acc) on the CCUser test dataset.Reward Accuracy used in DPO11 is the ratio of data points for which chosen reward > rejected reward.
We find that DPO is better than SALT l which is just equivalent to likelihood training on S E .This is expected since DPO also uses S AI .However, DPO gets lower performance than SALT l+u .When we change hyper-parameter β to get higher Reward Accuracy, others (ROUGE, and Meteor) degrade, and vice versa.We think this is because, DPO penalizes the entire rejected summary, which is not suitable for human edit feedback, because most words in S AI and S E are the same.DPO does not explicitly consider such cases, and hence, it might be difficult for DPO to learn an implicit reward through S AI and S E without using the fine-grained relationship between their tokens.It is interesting to see that Reward Accuracy is higher for SALT than DPO, even though the SALT loss function does not explicitly maximize chosen and rejected log probability like DPO.
It should be noted that DPO was developed for using comparisons and not human edit feedback.For human edits feedback, a straightforward way to improve DPO could be to modify the loss function to use only the "negative tokens" in the rejected summary, which aligns with our SALT ideas.

Conclusion
In this work, we explore improving language models with Human Edits feedback, which can be collected scalably than others.Specifically, we propose the SALT training objective based on sequence alignment and unlikelihood training and show how to design Imitation Edits to reduce the need for expensive HF.We further show on human edits data, SALT performs better than a straightforward RLHF (DPO) approach.

Limitations
In our experiments, we find that our method improves relatively smaller language models like T5. Due to the limitation of computational resources, we are not able to try our methods on larger language models.So we don't understand which HF (human feedback or human edit data) is better on LLMs.But like what we discussed in Section 1, Human-Edits have many unique advantages from an ML data point of view.Given that it's a natural way to collect feedback from users as they fix our AI-generated summaries for their workflow, many products in the industry can more easily use this HF approach and our SALT method to improve their text generation quality without too much extra effort.In addition, other HF methods should be explored more in various domains and models of various sizes so as to help the NLP community find the most suitable HF method in various scenarios.
Another point that has not been explored in this paper is LLM-in-the-loop.With the emergence of GPT3.5 and ChatGPT, LLM has shown a level close to or beyond human beings in many domains.In this paper, we did not use LLMs to conduct experiments similar to Human Edits (that is, treat the LLM as a human to modify S AI to get S E(LLM ) ). Ideally, this would provide better Imitation-Edits to reduce HF costs.In addition to time and resource constraints, as we discussed in Section 1, data privacy issues make it hard for many practitioners in the industry to input their data into these third-party APIs or service websites for related experiments.LLM-in-the-loop is undoubtedly a worthwhile next step in the future, and we will study how to deal with related data privacy issues.This will also be a problem to be solved for many other tasks in medical and other privacy-oriented domains.
The current implementation of our methods also has some room for improvement.Our code currently only tries one global sequence alignment algorithm, the Needleman-Wunsch Algorithm.In fact, there are many alternatives that can help the model improve in different aspects.For example, how to improve factuality during LM's summaries is one key topic for both NLP and BioNLP community (Tang et al., 2022;Abacha et al., 2023b;Chang et al., 2023).Some previous work exploring language models and knowledge has shown that insufficient knowledge may lead to factual errors (Petroni et al., 2019;Sung et al., 2021;Yao et al., 2022a,b).So we can limit the scope of se-quence alignment to the medical entities (Luo et al., 2022) or jargon (Kwon et al., 2022) to help the model focus more on important tokens during the training process to reduce hallucination further.

Ethics Statement
The methods related to unlikelihood training are very dependent on the quality of negative candidates.In this paper, we propose a very general framework to provide negative candidates, that is, to calculate the sequence alignment between S AI and Human-Edits or Imitation-Edits.There will be some potential problems in actual deployment: First of all, for Human-Edits, we don't know whether the user is modifying because of some kind of error in S AI or because of the user's personal preference.These two behaviors need to be distinguished in future research or actual deployment because the former data is more suitable for improving the problems of the model itself (such as some factual errors), and the latter data is more suitable for user-personalized training data.Secondly, whether for Human-Edits or Imitation-Edits, when a large number of complex Edits appear, the sequence alignment algorithm we currently use may not be able to get the correct negative candidates, resulting in rewards or penalties for wrong tokens.
In the experiments in this paper, we use some filters to control the quality of the training data provided for unlikelihood training, but the reality will be very complicated.In addition to using similar filters in this paper, another solution is to directly track the users' changes as they edit the summary on the product, and the subsequent training steps will not change.But this will add a lot of extra overhead to the product engineering.

A Appendix
A.1 SOAP Structure The SOAP (Subjective, Objective, Assessment, and Plan) structure is commonly used by providers (Podder et al., 2021).* The Chief Complaint section is a brief description of a patient's conditions and the reasons for the visit.* The Subjective section is a detailed report of the patient's current conditions, such as source, onset, and duration of symptoms, mainly based on the patient's self-report.This section usually includes a history of present illness and symptoms, current medications, and allergies.* The Objective section documents the results of physical exam findings, laboratory data, vital signs, and descriptions of imaging results.* The Assessment section typically contains medical diagnoses and reasons that lead to medical diagnoses.The assessment is typically based on the content of the chief complaint and the subjective and objective sections.* The Plan section addresses treatment plans based on the assessment.

A.2 Implementation Details
Due to data privacy issues, we cannot disclose our CC and CCUser datasets.But for the reproduction of our methods, in the Appendix, we also use two general domain summarization datasets, CNN/Daily Mail (CNN) (See et al., 2017) and Extreme Summarization (XSum) (Narayan et al., 2018) to test the imitation-edit experiments.
The summarization model used in this paper is based on the publicly available T5-small model13 and T5-large14 .Note that the experimental results of our t5-large-based model are not real human edit feedback for the summaries it generates, because of some deployment and privacy issues, we can only collect the CCUser data (Human Edits) for t5samll-based-model-generated summaries via our mobile app.Therefore, we put t5-large-related results only in the appendix.All the results in Section 6.1 are for our t5-small-based model.But overall, the patterns and findings are consistent on both t5-small and t5-large.the original data.
In imitation-edit evaluation results (CC test−r , CN N test−r , XSum test−r ) of Table 6, SALT l+u has better performance than the baseline method SALT l in all three experiments, which is consistent with our results using human-edit data in Table 3.In the original data evaluation results (CC eval , CN N eval ) of Table 6, although there was no forgetting problem in the first two experiments, SALT l+u still has a higher score than the baseline model SALT l .In the third experiment, we successfully imitated the forgetting problem similar to CC and CCUser by using the distribution difference between CNN and XSum.Similar to the results in Table 3, SALT l+u can alleviate the forgetting problem to a certain extent while improving the performance on the new dataset.

A.4 More Discussion
Why does SALT work?First, SALT makes good use of the S AI data.From the perspective of data augmentation, S E provides a new ground truth summary from the user, and the users also verify the remaining tokens in S AI .SALT helps the model to use all the tokens in both S E and S AI , which greatly improves the utilization of humanedit data.Second, SALT gives the model more objectives.Using S AI in SALT makes the model not just "be close to the correct distribution" as in the likelihood training, but also "be far away from a negative distribution".Thus, we can teach the model to avoid making the same mistakes again, which has a special meaning for the user (Assumption 1).
Human Edits and Imitation Edits Even though SALT can be used with human-edit data or imitation-edit data to improve the summarization models, our experiments are not enough to conclude that Imitation Edits can completely replace In this task, we ask for your expertise in annotating the quality of system-generated SOAP notes by machine learning models.Mainly we provide a conversation snippet and a human-written reference SOAP note for the respective snippet, along with system-generated summaries, and ask for your preference.
Output your ranking for system-generated summaries.Use the following format, and do not add any other text.Human Edits.Using Imitation Edits is essentially a kind of data augmentation method during training.But, when we have edits to our model's original output from our real users, we have the unique opportunity to improve model output according to their individual expectations.SALT can model such information during the training and help the model have more appropriate behaviors to serve the users better in a more data-efficient way.
SALT and RLHF We discuss SALT and DPO in the Section 7. Regarding the relationship between SALT and other RLHFs, we have some preliminary discussions here, and they need follow-up work to demonstrate.It seems that SALT keeps most of the advantages and disadvantages of DPO against PPO.Often, no reinforcement learning means more stable and easy training (and hyper-tuning).Our human eval also shows that SALT can make models more aligned with human preference without explicit reward models, which is the same with DPO.Also, it's questionable whether a good explicit reward model can be learned from S AI and S E since it's not as easy as positive or negative movie reviews to distinguish.For limitation, How does the SALT model generalize out of distribution, compared with PPO with an explicit reward function?For example, standard RLHF methods can leverage additional unlabeled prompts by labeling LM generations with the learned reward model.Can training with self-labeling from the SALT similarly make effective use of unlabeled prompts?Other papers like RAFT and RRHF use explicit reward models to filter high-score data points for SFT.Whether we can train a good reward model as a good filter is also a big question here.Another difference is that we will make full use of all data points (S AI + S E ) during the training, but they will only use high-quality ones (S E ) and discard the rest (S AI ).So theoretically, we use data more efficiently and model more information from S AI .
through different weights on the loss function.Increasing the loss weight of 1 E−C will make the model generate more words/concepts belonging to 1 E−C (G w2 and G c2 ), which follows our SAGE Assumption 2. While reducing the loss weight of 1 E−C will make the model generate fewer words and concepts belonging to 1 E−C (G w2 and G c2 ), at the same time it can also reduce the generation of words/concepts belonging to 1 AI−C (G w1 and G c1 ), which satisfies our SAGE Assumption 1.So SALT l d and SALT l i make the model better for users according to the SAGE metric.Second, unlike the three above SALT variations, SALT u only uses S AI but it knows which tokens in S AI belong to 1 AI−C and 1 AI−N C respectively.So SALT u significantly reduces the words and concepts belonging to 1 AI−C .However, because the data of 1 E−N C has not been seen, SALT u rarely generates related words and concepts.Finally, SALT l+u has more granular information-that tokens belonging to 1 AI−C , 1 AI−N C , 1 E−C , and 1 E−N C in S AI (S E ) through their corresponding loss weights.Therefore, SALT l+u can learn the more suitable distribution, which decreases the generation of words and concepts belonging to 1 AI−C while increasing the generation of words and concepts belonging to 1 AI−N C , 1 E−C and 1 E−N C .

Figure 1 :
Figure 1: CCUser&CC GPT4 preference.We instructed GPT4 to give preference ranking for 4 AI-generated summaries (on 500 data points): M (not trained on CCUser), SALT l , SALT l+u , SALT l+u +RSALT l+u .(1) SALT l+u is most preferred by GPT4 on CCUser, (2) while SALT l+u +RSALT l+u is most preferred by GPT4 on CC. (3) CC on human preference (on 25 data points) for M, SALT l , SALT l+u , and SALT l+u +RSALT l+u .

Table 1 :
Example of conversation-to-notes summarization data from Clinician Conversations (CC) dataset and corresponding human-edit dataset, CCUser, where useredited summaries-S E , made from the AI-generated ones-S AI , from our SOAP generation pipeline.

Table 2 :
Average words in CC and CCUser.
).We call this System output Against the Generated and Edited sentence (SAGE).Given the evaluation data (U , S AI , S E ), where S AI is generated by the model trained by the original summarization dataset (e.g., CC) and S E is edited by human based on (U , S AI ), we can get the new summary S new generated by the new model trained by Human Edits dataset (e.g., CCUser).Using (S new , S AI , S E ), we can define three groups of words after removing stop words and punctuation in S new :1

Table 3 :
Human Edits results.Compared to the likelihood training SALT l , our proposed SALT l+u has better performance on both new human-edit CCUser eval and the model's prior training CC eval dataset, when using just CCUser eval for training (Section 6.1.1).Further, we show that the catastrophic forgetting problem can be addressed with Replay-based argumentation to our method-RSALT (Section 6.3). 6y converged on the CC dataset.All methods below are initialized from M and continue training on S E , S I , and S AI .2. SALT l : the baseline, which is only based on likelihood training on S E or S I 3. SALT l d (or SALT l i ): likelihood training on S AI ) 6. SALT x : all the above SALT variations 7. SALT x +RSALT l is the traditional replaybased method.When continuing to train M with different SALT variations on new data, this method will sample a part of the data from the dataset that M has already seen and use them for training with likelihood loss.8.SALT x +RSALT l+u : following Section 4.3,RSALT treats sampled data from the replaybased method as imitation-edit data and uses both likelihood and unlikelihood training.In Table3, the evaluation on the CCUser eval shows compared to the regular likelihood training E or S I , but with decreased (or increased) weights for 1 E−C or 1 I−C tokens 4. SALT u : only unlikelihood training on S AI 5. SALT l+u : both likelihood (on S E or S I ) and unlikelihood (on S

Table 4 :
Word-level and concept-level SAGE for CCUser eval normalize by SALT l as the baseline.(SALTl ),ging loss weights for 1 E−C tokens in likelihood training (SALT l d or SALT l i ) can bring changes to their performance.Predictably we see in Table4, that SALT l i produces higher G w2 than SALT l d , and the trends in other columns are not as pronounced since S AI isn't considered.Similarly, SALT u produces lower G w1 than the others.However, SALT l+u achieves significantly higher performance on both CC and CCUser.We further show how we can manipulate a model's behaviors using different SALT through SAGE in Table4.First, SALT l only uses S E , and all tokens in S E contribute to the loss equally.SALT can increase or decrease the emphasis of the model on 1 E−C

Table 5 :
SALT results for imitation-edit experiments.The imitation-edit data come from the training dataset which the model M has already seen by assuming the ground truth is generated by editing the model's output.

Table 6 :
Imitation Edits experiments.Here the imitationedit data comes from a subset of the corresponding test dataset (we don't use them in the table for metrics), which M has never seen before.We use CC-test for SALT and CC-train for RSALT during training.

Table 11 :
Imitation Edits experiments: Here the imitation-edit data comes from a subset of the corresponding test dataset (we don't use them in the table for metrics) which M has never seen before.() and <> show the training data used by SALT and RSALT respectively.