AutoTrial: Prompting Language Models for Clinical Trial Design

Clinical trials are critical for drug development. Constructing the appropriate eligibility criteria (i.e., the inclusion/exclusion criteria for patient recruitment) is essential for the trial's success. Proper design of clinical trial protocols should consider similar precedent trials and their eligibility criteria to ensure sufficient patient coverage. In this paper, we present a method named AutoTrial to aid the design of clinical eligibility criteria using language models. It allows (1) controllable generation under instructions via a hybrid of discrete and neural prompting, (2) scalable knowledge incorporation via in-context learning, and (3) explicit reasoning chains to provide rationales for understanding the outputs. Experiments on over 70K clinical trials verify that AutoTrial generates high-quality criteria texts that are fluent and coherent and with high accuracy in capturing the relevant clinical concepts to the target trial. It is noteworthy that our method, with a much smaller parameter size, gains around 60% winning rate against the GPT-3.5 baselines via human evaluations.


Introduction
Generative large language models (LLMs) are drawing attention due to their ability to create coherent and human-like text documents.Clinical trial design documents are written at the planning stage of the drug development process, which is crucial for the success of the trial.However, it can be challenging even for experienced professionals: around 57% trial protocols have at least one substantial amendment in eligibility criteria (CSDD, 2016).The suboptimal trial design may cause insufficient recruitment, severe adverse events, or insignificant efficacy, thus inducing huge financial losses and time waste.Each amendment will further cause millions of dollars in loss and months of delays.
In this paper, we propose to generate the eligibility criteria for clinical trials in natural language using LLMs, with the solution focusing on the following aspects.
• Comprehending instructions.The LLM will be prompted with key information about a trial, such as the target conditions and treatments, and the additional instruction to generate the criteria for participant recruitment.This process requires the employed LLM to comprehend the input and adapt to the input instruction to generate precise eligibility criteria that meet the specified objectives.It also necessitates the domain knowledge about clinical trials stored in LLMs.
• Referring to prior studies.A thorough literature search is important for human experts to design clinical trials (Chew, 2019).Similarly, the employed LLMs should be able to leverage the context information, such as retrieved eligibility criteria from prior successful studies, as the reference to generate better trial design.
• Rationalizing the generation.LLMs should offer the rationale behind the generated criteria, which is important for clinical experts to understand and adopt the generation results in practice.
In this paper, we propose to augment clinical trial design using LLMs, motivated by the evidence that LLMs can act as implicit knowledge bases (Petroni et al., 2019;Taylor et al., 2022).Our method is equipped with instruction tuning for trial protocol design and explicit supervision for producing grounding rationales.This is enabled with the following technical features: • Instruction prompting for adapting expert instructions.Instruction tuning for granular control over the generated criteria to follow diverse user intentions.
• Scalable and efficient knowledge expansion.A combination of 1) external memory for a dense re-triever and 2) internal memory for neural prompting, which is amenable to updating the model incrementally as new data are available.
• Explicit supervision for generating grounding rationales.Adaption of LLMs with a reasoning capability through a supervised paradigm, making the generated criteria more transparent and interpretable.
Our AutoTrial is the first model that utilizes LLMs for automating clinical trial design.It represents an important step toward using AI to facilitate clinical trial design.The rest of the paper is organized as follows: In §2, we review related work.In §3, we dive into the proposed method in detail.In §4, we present the experiment results.It is noteworthy that AutoTrial is proven to generate accurate criteria (with precision 0.91, recall 0.92, F1 0.91, and Jaccard score of 0.84 in clinical accuracy evaluation) while almost all baselines get less than 0.5 in these metrics.Moreover, our method reaches around 60% winning rate against GPT-3.5 1 in trial design tasks via human evaluation.Finally, in §5, we conclude and discuss future work.

Large Language Models
Large language models pre-trained on web-scale text data exhibit extraordinary emergent capability in a diverse set of natural language processing (NLP) tasks (Kaplan et al., 2020;Brown et al., 2020).It was recently witnessed that LLMs can be further tuned to align with human preferences through instruction tuning (Chung et al., 2022;Wang et al., 2022) and reinforcement learning from human feedback (RLHF) (Ouyang et al., 2022;Yuan et al., 2023;Dong et al., 2023).
Despite the remarkable capabilities of large language models (LLMs) trained on general text corpus, they often face challenges when generating highly domain-specific tasks unless they undergo additional tuning.Research has shown that even a "small" 300M LM, with instruction tuning, can outperform LLMs with over 100B parameters (Yasunaga et al., 2022).This finding encourages the efforts to develop customized expert LLMs by performing instruction tuning on domain-specific datasets, e.g., clinical notes and scientific publications (Singhal et al., 2022).In this work, we 1 Engine gpt-3.5-turbo-0301:https://platform.openai.com/docs/models/gpt-3-5are the first to develop LLMs focusing on trial design through a mixture of techniques, including instruction tuning, evidence-grounded generation, and supervised learning for rationale generation.

Clinical Trial Design
The clinical trial design is a new research topic for the NLP community, and there are only a few works related to clinical trial design, either focusing on trial feature embedding or trial design evaluation.For trial feature embedding, Marshall et al. (2017) extracted text pieces that describe the key trial characteristics as a summary report.More recently, Wang and Sun (2022) developed a self-supervised document model for dense retrieval for clinical trials.For trial design evaluation, Kim et al. (2021) manually adjusted criteria to broaden patient accrual and assess the influence of criteria, while Liu et al. (2021) utilized Shapley scores (Lundberg and Lee, 2017) to estimate the change of hazard ratio of the included oncology patients when removing each criterion.Despite these efforts, no existing work focuses on clinical trial design automation.

Method
AutoTrial utilizes a decoder-based architecture for generating a target criterion based on input trial synopsis and manual instructions.The training process consists of two stages: pretraining and finetuning.
• During the pretraining stage, the model is trained on a large corpus of trial documents in order to learn to reason through multiple steps and mimic the retrieved input criteria exemplars.
• In the finetuning stage, the model is trained to generate the target criterion according to the input instructions.For example, an instruction <age> that urges the model to populate the criterion describing the participant's age requirement.
It is noteworthy that the model can be extended to new instructions and trial exemplars without retraining.The flowchart is depicted in Fig. 1.We will elaborate on the details of the training and inference procedures of AutoTrial next.

Problem Setup
The generation model is represented by the function f , and generates a target criterion y c based on input x = {x s , x e , x r }.Here, x s denotes trial    Step I: pre-train on unlabeled trial documents with prompts to mimic the multi-step reasoning.
Step II: finetune the model to generate criteria under instructions.
Step III: generate diverse target criteria by instructions with large-scale sampling plus clustering and ranking.
setups, which is a concatenation of the trial title, condition, and treatment, as the elements illustrated by Fig. 1. x r denotes the discrete prompt describing the objective criterion, e.g., "bmi" prompts the model to generate the criterion for body mass index of the participants.x e denotes the exemplars retrieved from relevant trials built for the in-context learning of LLMs.To this end, we formulate x e = {x t e , x r e , x c e }, which contains the reasoning steps x t e , e.g., a chain of criteria that leads to the target criteria, the targeting instruction x r e , e.g., the objective to be set for recruiting patients, and the target criterion x c e that describes the requirement based on the instruction.
The model is also controlled by continuous prompt h p , which is specific to each type of instruction, e.g., the targeting entity that the criterion should contain.The model is trained to generate criteria y with multi-step reasoning: generating relevant criteria one by one and ultimately yielding the target criterion.Therefore, the generation process is expressed in Eq. ( 1), y = f (x s , x e , x r , h p ). (1) Referring to the exemplar x e , the model outputs y = y t ⊕ y c , where y t is the reasoning steps and y c denotes the target criterion.

Hybrid Prompting
We opt to employ a hybrid of discrete and neural prompting to endow the model with the ability to generate criteria based on specific instructions.

Discrete Prompt
The discrete prompt is motivated by the prospect of in-context learning of LLMs (Wei et al., 2022), as the reasoning ability of LLMs can be enhanced via the input-output exemplars, e.g., the concatenation of a series of criteria x t e , the target instruction x r e , and the target criteria x c e .We formulate the discrete prompts with specialized tokens: 1. Trial Setup: we wrap the introduction of trial setups, including title, disease, and treatment, using specialized tokens like <title>, <disease>, and <treatment>.The setup offers the basic context of a trial.

2.
In-context Exemplar: we curate the exemplar that resembles the multi-step reasoning procedure: the model first generates a series of intermediate rationales that lead to the final answer.Concretely, the exemplar x e = {x t e , x r e , x c e } is retrieved from the external knowledge store and demonstrate as the template for the model outputs.x t e are many eligibility criteria wrapped by <inc> or <exc> indicating inclusion or exclusion criteria.x r e describes the instruction wrapped by <statement>, e.g., tell the model to generate a criterion describing the age requirement by "<statement> age".x c e is the target criterion wrapped by <target>, e.g., "<target> age is above 18 yrs old".
3. Textual Instruction: following the exemplar, x r enforces the model to obey the instruction, wrapped by <statement> such as "<statement> gender".
The exemplars are stored in an external knowledge store providing an open-book reference that the model can refer to during the generation.It is built on the training data C = {(x i , y i )} N i that is amenable to edit, add or delete during the course of training and generating needless of retraining the model.We utilize a neural encoder T named Trial2Vec (Wang and Sun, 2022) that encodes trial setups x s to dense embeddings, as h s = T (x s ) ∈ R d that carry rich semantics of the trials.Consequently, the knowledge store is given by Eq. ( 2), with the embeddings serving as the keys and the exemplars as the values.Here, the vector-based search engine can be implemented for efficient exemplar retrieval on the fly.

Neural Prompt
Consider the embeded input tokens Formally, we create a set of instruction indices I, the i-th instruction x r,i is parameterized by Eq. ( 3), where E r ∈ R |I|×d ′ is the trainable embedding matrix.E r [i, :] indicates looking up the i-th row of the matrix; MLP : R d ′ → R d projects the embedded instruction to the same dimension as H.
The neural prompting is modular, meaning that it can be easily modified to incorporate additional instructions I ′ by simply extending the index set I = {I, I ′ } and the embedding matrix E r = {E r , E ′ r } for those instructions.When the model is finetuned on new data, we can only update the instruction embedding E ′ r while the rest of the model remains frozen.This allows the model to effectively learn to generate based on a broader range of instructions while minimizing the risk of catastrophic forgetting, i.e., the performance degradation on previous data.

Multi-stage Training
As described in §3.1, we have a dataset containing pairs of input instructions (denoted as x r ) and corresponding criteria (denoted as y c ).We extract clinical relations from the raw criteria to formulate the training and testing data, e.g., extracting the relation "NYHA ∈ {III, IV}" from the criteria "NYHA class is above II".However, the parser may not be able to extract all relevant instructions from all available trial documents.We hence propose to train our method in two stages: first pretraining on a large set of unlabeled trial documents and then finetuning on the processed dataset of instructioncriteria pairs.This approach allows us to make the most of the available data and facilitate the model performance.
Pretraining.We create a pretraining dataset where the model f is urged to generate y = y t ⊕y c in Eq. ( 1).The inputs comprise the trial setup x s and the exemplar x e which is also composed of multiple criteria.Drawing the inspiration from (Taylor et al., 2022), we decide to include prompts and special tokens in the pretraining stage.Specifically, we explicitly emphasize the step-by-step reasoning task by inserting the separate tokens <inc> and <exc> into x e and y t , and the model is supervised to generate the intermediate rationales and yield the target criterion.
Our method is built based on decoder-based CLM (e.g., GPT2 (Radford et al., 2019)) where the decoder predicts y autoregressively.Denote the learned decoding distribution as p θ (•), the objective is the maximum log-likelihood estimation given by Eq. ( 5), where y <l are tokens in y before the l-th token; L is the total number of tokens in the target y.
Finetuning.After pretraining, the model is finetuned on the dataset C, and taught to follow the instruction when generating criteria.The inputs and outputs are described in Eq. ( 1).In addition to the MLE loss in Eq. ( 5), we apply a contrastive loss L CL (Su et al., 2022) to enhance the model representation learning, as in Eq. ( 6), where h y l is the embedding of token y l , ρ is the predefined margin, s(•) is the cosine similarity function.The finetuning loss combines the objectives from Eqs. ( 5) and ( 6) as given by Eq. ( 7), Note that the decoding distribution in the finetuning stage is p θ (y|y <l , x, h p ) that differs from the one used in the pretraining shown in Eq. ( 5).

Generation
Denote the vocabulary by V, we conduct top-k sampling repeatedly to acquire diverse candidates Ŷ = {ŷ q } Q q=1 , by Eq. ( 8), where V (ks) is a subset of V that maximizes y∈V (ks) p θ (y|y <l , x, h p ), and |V (ks) | = k s .We further adopt clustering and ranking to select samples from the generated candidates.We first encode ŷ by Trial2Vec to dense embeddings h ŷ and apply k-means clustering with k q clusters.We then compute the perplexity (ppl) of each output ŷq , and pick the sample with the minimum ppl in each cluster to form the final candidate set with k q samples.An example of the input and output of AutoTrial can be found in Table 5 in the appendix.

Experiment
We conduct extensive experiments to evaluate AutoTrial in the following aspects: • The overall quality in terms of criteria-level and trial-level trial protocol generation.
• The capability of the continual update for new trials and instructions.
• The ablation analysis for the components of the proposed method.

Dataset
We

Evaluation Strategy
Automatic Evaluation.To evaluate the quality of the output criteria, which are expressed in natural language, we employ metrics from the NLG literature, including CIDEr (Vedantam et al., 2015), ROUGE-L (Lin, 2004), METEOR (Lavie and Agarwal, 2007), and BLEU (Papineni et al., 2002).These metrics allow us to assess the fluency and coherence of the generated criteria quantitatively.We evaluate all the methods at the criteria level and trial level.At the criteria level, the model will generate each criterion separately, using the concatenated trial setup texts and the first three tokens of the targeting criteria as input.At the trial level, the model will take the concatenated trial setup texts as input and generate all of the criteria for the trial at once.Clinical Accuracy.To evaluate the clinical accuracy of the generated criteria, we run Clinical Trial Parser (FAIR, 2022) on the generated criteria to extract the medical relations and compare them with the relations extracted from the corresponding ground-truth criteria.We evaluate the overlapping of two relation sets by the precision, recall, F1-score, and Jaccard similarity.Human Evaluation.We perform a manual evaluation to compare the generated clinical trial design from our method with the generated by a general LLM.We enlisted the expertise of domain experts to assess and choose the superior output between our method and the LLM's output for a given trial synopsis.This allowed us to collect feedback and calculate the winning rate.
For our method, we leverage GPT-2 (Radford et al., 2019) as the backbone model.We set the maximum context length as 1,024.In the pertaining stage, we train the backbone model with a batch size of 64, learning rate 5e-5, weight decay 1e-4, and 5 epochs.In the instruction tuning stage, we train the model with a batch size of 16, learning rate 5e-5, weight decay 1e-5, and 10 epochs.

Exp 1: Generation Quality
Text quality.Table 2 shows the automatic evaluation scores with the evaluation done at the trial level.The results show that AutoTrial demonstrates superior performance over the baselines in all four metrics (BLEU-1, METEOR, ROUGE-L, and CIDEr) for both inclusion and exclusion criteria.We can draw similar conclusions from Table 3, with the evaluation at the criteria level.
One notable finding is that the performance for inclusion criteria generation is generally better than for exclusion criteria generation.We conjecture that inclusion criteria are presented prior to exclusion criteria in the training data, which may lead to the truncation of the latter due to the model's maximum acceptable length.Besides, errors may accumulate when generating criteria in an autoregressive manner.AutoTrial mitigates the order issue credited to the hybrid prompting.Clinical accuracy.We present the clinical accuracy evaluation results in Table 4.As aligned with the automatic evaluation results, AutoTrial performs better at criteria generation, with a bigger performance gap.For example, AutoTrial is the only method that yields recall above 0.5 (w/ 0.91), F1 above 0.6 (w/ 0.91), and Jaccard scores above 0.4 (w/ 0.84) in inclusion criteria generation.It wins over baselines with a prominent margin in exclusion criteria generation.These results demonstrate that our method can generate criteria accurately aligned with the provided instructions.We also observe that most methods obtain decent precision and AutoTrial has the best performance.It implies a low hallucination rate in our method's generated text because most generated relations  are also concretely mentioned in the groundtruth eligibility criteria.However, the baselines perform much worse regarding the recall and Jaccard scores.
It indicates that AutoTrial is advantageous in the high coverage of targeting clinical relations in the generated criteria.Human Evaluation.The human evaluation results are available in Fig. 2, where we identify that our method significantly outperforms the GPT-3.5 baselines, i.e., in over 60% cases, the output criteria from AutoTrial are considered better than the from GPT-3.5.This again emphasizes the opportunity of developing expert LLMs that surpass general LLMs at much less cost.It is also interesting that 5-shot GPT-3.5 is worse than the 1-shot and zero-shot ones.We conjecture that GPT-3.5 is impacted by the context that contains irrelevant exemplars when generating for the targeting trials.

Exp 2: In-depth Analysis
Performance Divergence.We divided the raw test trials by their target diseases, leading to 100 nonoverlapping sets, with each set sharing the same target disease.We then evaluated the generated texts within each subset.We created box plots of the obtained scores (evaluated on the combination of inclusion and exclusion criteria) in Fig. 3.
Our results indicate that AutoTrial exhibits superior performance across all metrics.It achieves the highest median performance and has a more stable score distribution, with both a high upper bound and lower bound for all metrics.Among the baseline methods, SimCTG performs the best on three metrics, with the exception of METEOR.However, it should be noted that its worst-case performance was typically much lower than that of most other methods.We also zoom in to show the performances of trials targeting the top eight most frequent diseases/conditions in Fig. 6, where AutoTrial consistently wins over all baselines.Qualitative Analysis.We present several qualitative results of our model in Table 5.The model inputs have two parts: manual input and automatically built input, where the manual input is concatenated with the automatic input and passed to the model.Users set up the basic trial information and can opt to offer different instructions for generating criteria.As observed in the first four rows of Table 5, the outputs vary when provided with different instructions for the same trial.Furthermore, it can be observed that the generated outputs are fluent, coherent, and closely resemble the referential manually written criteria.

Exp 3: Incremental Learning
One major merit of AutoTrial is to continually update its internal and external memory without the need of retraining on all collected data.To demonstrate the capability of AutoTrial in continuously updating its knowledge, we designed two variants of our method: Re-train and Incremental.These variants were trained on four subsets of the raw training set: {C 1 , C 2 , C 3 , C 4 }, with the instruction types being equally assigned and mutually exclusive in each subset.The models encountered the subsets sequentially, with the Re-train model learning by combining all previously seen subsets, e.g., it would be trained and evaluated on {C 1 , C 2 } when C 2 is revealed.In essence, Re-train is the theoretic upper bound for all incremental learning methods.In contrast, the Incremental model is also evaluated on {C 1 , C 2 } but it would be trained on C 2 only when it is revealed.Additionally, during training, the Incremental model only updates the neural prompting while freezing all other parameters.
We present the results in Fig. 4. The Incremental model demonstrates the capability of mitigating catastrophic forgetting when extended to new data.However, the gap between the two variants expands over time.The Incremental model decays to the level of the best baseline after being updated on the fourth subset when the total number of instructions is 4× more than in the first subset.We hence suggest incrementally updating AutoTrial until the new instructions reach around 3× more than the last fully retrained checkpoint to reach a trade-off between utility and cost.

Exp 4: Ablation Study
We conducted an ablation study (shown in Fig. 5) to compare the original version of AutoTrial with its variants when removing certain components: the multi-step reasoning supervision (w/o MSR), the retrieval-augmented generation (w/o RAG), and the neural prompting (w/o Prompt).The results show that both RAG and Prompt have a significant impact on the final performance.MSR performs similarly on inclusion criteria compared to the original version but has worse results on exclusion criteria.Despite this, MSR is ultimately retained in the final model as it produces more balanced results among inclusion and exclusion criteria and also provides insight into the model's reasoning path, making it more interpretable.

Conclusion
In summary, this paper presents AutoTrial that uses language models to aid in the design of clinical trial protocols.Our method is able to generate high-quality criteria texts that are fluent, coherent, and clinically accurate, by using a combination of controllable generation, scalable knowledge incorporation, and multi-step reasoning.This can potentially reduce the risk of clinical trial failure by ensuring that trials are properly designed and have sufficient power to evaluate proposed therapies.

Limitations
The proposed method, AutoTrial, is a valuable tool for designing clinical trials by providing con-trollable generation under instructions, scalable knowledge incorporation, and multi-step reasoning.However, it is important to note that one limitation of the method is that it is dependent on the quality of data used to train the language model.If the clinical trial database used to train the model contains biases or inaccuracies, these limitations may be present in the generated criteria texts.To ensure the quality of the generated criteria texts, it is crucial to use high-quality, accurate, and up-todate data to train the language model, which can be achieved by regularly updating the clinical trial databases used for training.
Additionally, the method may not be able to account for unexpected or rare side effects or issues that may occur during the trial, which may impact the safety and efficacy of the proposed treatment.It is important to note that AutoTrial should be considered a supportive tool for designing clinical trials and the final decision should always be made by human clinicians.The tool can aid in identifying relevant trials and generating high-quality criteria texts, but ultimately, it is the responsibility of the clinician to evaluate the overall design and safety of the trial, taking into account the unique characteristics and needs of the trial population.The tool should be used as an aid in the design process, but not as a replacement for the expertise and judgment of human clinicians.

Figure 1 :
Figure 1: The workflow of the proposed AutoTrial.Step I: pre-train on unlabeled trial documents with prompts to mimic the multi-step reasoning.Step II: finetune the model to generate criteria under instructions.Step III: generate diverse target criteria by instructions with large-scale sampling plus clustering and ranking.

Figure 3 :
Figure 3: In-depth analysis of generation quality across 100 trial groups divided by the targeting disease.Baselines are based on GPT-2.

Figure 4 :
Figure 4: Comparison between two variants of AutoTrial: Re-train and Incremental.The Re-train variant is trained on all subsets, while the Incremental variant updates its knowledge only on new subsets.The scatter plot also includes the performance of four baselines, which are trained on all data.

Figure 5 :
Figure 5: Ablation experiments of AutoTrial when removing one module.AT: the original version; w/o MSR: without the multi-step reasoning supervision; w/o RAG: without the retrieval-augmented generation; w/o Prompt: without the neural prompting.

Figure 6 :
Figure 6: The generation quality across the trials targeting to the top-8 most frequent diseases/conditions.

Table 1 :
The statistics of the used trial data.
collected clinical trial documents fromClin- icalTrials.gov(NIH,2023)andfilteredout those without valid interventions, diseases, or titles, as well as those with void eligibility criteria.We extracted 75,977 clinical trials and applied Clinical Trial Parser (FAIR, 2022) to extract 751,258 medical relations from the eligibility criteria of these trials.The train-test split is shown in Table1.For each trial, we sampled one criterion as the target and several others as input exemplars, resulting in 2,528,231 unique training samples out of 400K trials as the pretraining data.The validation and test trials were excluded from the pretraining data.

Table 2 :
Automatic evaluation of eligibility criteria generation results on the test set on the trial-level, i.e., compare the concatenated inclusion/exclusion criteria of a trial with the corresponding groundtruth.B1 is short for BLEU-1.

Table 3 :
Automatic evaluation of eligibility criteria generation results on the test set on the criteria-level, i.e., compare the generated inclusion/exclusion criteria with the groundtruth one by one.B1 is short for BLEU-1.

Table 4 :
Clinical accuracy evaluation results of eligibility criteria generation results on the test set.P, R, F1, Jac are short for precision, recall, F1 score, micro-Jaccard score, respectively.

Table 5 :
Qualitative generation results of AutoTrial for criteria generation under instructions.Manual Input: the context textual inputs offered by users, where trial setups x s are shared for the same trial and instructions x r are specific to each criteria; Automatically Built Input: the reference criteria automatically retrieved and built as the input x e for AutoTrial; Output: results generated by AutoTrial; Groundtruth: the corresponding criteria written by human clinicians in the original trial documents.The Manual Input and Automatically Built Input will be concatenated as the final input.yellow highlights the instruction tokens; green highlights the setup tokens; blue highlights the reference tokens; red highlights the target tokens.Reference texts are truncated in the middle due to the limited space.10% who have been diagnosed with t2dm at least 8 weeks ... <exc> severe gastrointestinal diseases: active ulcer, gastrointestinal or rectal bleeding, active inflammatory bowel syndrome, biliary duct obstruction, active gastritis that is not controlled by medication, etc.