Trial2Vec: Zero-Shot Clinical Trial Document Similarity Search using Self-Supervision

Clinical trials are essential for drug development but are extremely expensive and time-consuming to conduct. It is beneficial to study similar historical trials when designing a clinical trial. However, lengthy trial documents and lack of labeled data make trial similarity search difficult. We propose a zero-shot clinical trial retrieval method, Trial2Vec, which learns through self-supervision without annotating similar clinical trials. Specifically, the meta-structure of trial documents (e.g., title, eligibility criteria, target disease) along with clinical knowledge (e.g., UMLS knowledge base https://www.nlm.nih.gov/research/umls/index.html) are leveraged to automatically generate contrastive samples. Besides, Trial2Vec encodes trial documents considering meta-structure thus producing compact embeddings aggregating multi-aspect information from the whole document. We show that our method yields medically interpretable embeddings by visualization and it gets a 15% average improvement over the best baselines on precision/recall for trial retrieval, which is evaluated on our labeled 1600 trial pairs. In addition, we prove the pre-trained embeddings benefit the downstream trial outcome prediction task over 240k trials. Software ias available at https://github.com/RyanWangZf/Trial2Vec.


Introduction
Clinical trials are essential for developing new medical interventions (Friedman et al., 2015).Many considerations come into the design of a clinical trial, including study population, target disease, outcome, drug candidates, trial sites, and eligibility criteria, as in Table 1.It is often beneficial to learn from related clinical trials from the past to design an optimal trial protocol (Wang et al., 2022b).However, accurate similarity search based on the lengthy trial documents is still in dire need.Self-supervision based pretraining has delivered promising performances for many NLP and CV tasks with fine-tuning (Devlin et al., 2019;Liu et al., 2019;He et al., 2021;Bao et al., 2021;Wang et al., 2022c).Nevertheless, we find there was few work on zero-shot document retrieval as most address document retrieval in a supervised fashion (Humeau et al., 2019;Khattab and Zaharia, 2020;Guu et al., 2020;Karpukhin et al., 2020;Lin et al., 2020;Luan et al., 2021;Wang et al., 2021;Hofstätter et al., 2020;Li et al., 2020;Zhan et al., 2021;Hofstätter et al., 2021b,a;Jiang et al., 2022) or improve document pre-training for further supervision (Beltagy et al., 2020;Zaheer et al., 2020;Ainslie et al., 2020;Zhang et al., 2021).
Recently, a burgeoning body of research (Gao et al., 2021;Wu et al., 2021;Wang et al., 2022a) proposes to execute self-supervised learning to train semantic-meaningful sentence embeddings free of labels.However, there are still challenges to apply them for document similarity search: • Lengthy documents.These zero-shot BERT retrieval methods all work on short sentences (usually below 10 words) similarity search while trial documents are often above 1k words.Simply encoding lengthy trials by truncating and averaging embeddings of all remaining tokens inevitable leads to poor retrieval quality.
• Inefficient contrastive supervision.These unsupervised methods take simple instance discriminative contrastive learning (CL) within batch, e.g., SimCSE (Gao et al., 2021) takes one sentence into the encoder twice to get the positive pairs and all other sentences as the negative.This paradigm has low supervision efficiency to require a large batch size, large data, and long training time, which is infeasible for learning from long trial documents.
In this work, we propose Clinical Trial TO Vectors, Trial2Vec, a zero-shot trial document similarity search using self-supervision.We design a trial encoding framework considering the meta-structure to rid the risk that semantic meaning vanishes due to the uniform average of token embeddings.Meanwhile, the meta-structure is utilized to generate contrastive samples for efficient supervision.Medical knowledge is introduced to further enhance the negative sampling for CL.Our main contributions are: • We are the first to study the trial-to-trial retrieval task by proposing a label-free SSL model which is able to encode long trials into semantic meaningful embeddings without labels.
• We propose a data-efficient CL method on medical knowledge and trial meta-structure, which is promising to be extended to further zero-shot structured document retrieval.
• We demonstrate the superiority of Trial2Vec on a trial relevance dataset of 1600 trials annoated by domain experts.Also, we show Trial2Vec can assist better downstream trial outcome prediction on a dataset of 240k trials.
2 Related works

Text & document retrieval
General texts.Early information retrieval methods depend on manual engineering (Robertson and Zaragoza, 2009;Yang et al., 2017).By contrast, dense retrieval methods based on distributional word representations, e.g., Word2Vec (Mikolov et al., 2013), Glove (Pennington et al., 2014), Doc2Vec (Le and Mikolov, 2014), etc., become popular crediting to their superior performance.The advent of deep models, especially the contexualized encoders like BERT (Devlin et al., 2019), encourages an explosion of neural retrieval methods (Van Gysel et al., 2016;Zamani et al., 2018;Guo et al., 2016;Dehghani et al., 2017;Onal et al., 2018;Reimers and Gurevych, 2019;Chang et al., 2019;Nogueira and Cho, 2019;Chen et al., 2021;Lin et al., 2020;Xiong et al., 2020;Karpukhin et al., 2020;Yates et al., 2021).However, most of them are based on supervised training on sentence pairs from general texts, e.g., SNLI (Bowman et al., 2015).When label is expensive to acquire, as in the clinical trial case, we need zero-shot learning models.Although, there arose some works to perform post-processing on pretrained BERT embeddings to improve their retrieval quality (Li et al., 2020;Su et al., 2021), their performances are far from optimal without specific training.Clinical trials.Traditional clinical trial query search systems (Tasneem et al., 2012;Tsatsaronis et al., 2012;Jiang and Weng, 2014;Park et al., 2020) are established on protocol databases.Contrast to dense retrieval, these methods rely on entity matching with rules thus not flexible enough.Recent works (Roy et al., 2019;Rybinski et al., 2020Rybinski et al., , 2021) ) propose supervised neural ranking for clinical trial query search.However, all of them work on matching trial titles or relevant segments with an input user query.While Trial2Vec can also assist query search, it is the first to encode complete trial documents for the trial-level similarity search.

Text contrastive learning
Contrastive learning is a heated discussed topic recently in NLP and CV (Chen et al., 2020a,b;Chen and He, 2021;Carlsson et al., 2020;Zhang et al., 2020;Wu et al., 2020;Yan et al., 2021;Gao et al., 2021;Wang et al., 2020b;Wang and Sun, 2022).CL is one main topic under the SSL domain.It sheds light on reaching comparable performance as supervised learning free of manual annotations.While CL has been applied to enhance downstream NLP applications like text classification (Li et al., 2021;Zhang et al., 2022), a few (Wang et al., 2020a;Zhang et al., 2020;Yan et al., 2021;Yang et al., 2021) are able to do zeroshot retrieval.Nonetheless, all focus on enhancing sentence embeddings by manipulating text only therefore are suboptimal when facing lengthy documents.By contrast, Trial2Vec uses the document meta-structure with domain knowledge to obtain and facilitate document embeddings.

Method
In this section, we present the details of Trial2Vec.
The main idea is to jointly learn the global and local representations from trial documents considering their meta-structure.Specifically, observed in Table 1, trial document consists of multiple sections while the key attributes (e.g., title, disease, intervention, etc.) occupy a small portion of the whole document.This motivates us to design a hierarchical encoding and the corresponding contrastive learning framework.The overview is illustrated in Fig. 1.Our method generates local attribute embeddings using the TrialBERT backbone separately, then aggregating local embeddings with a learnable attention module to obtain the global trial embeddings that emphasize significant attributes.We present the pretraining of backbone encoder in §3.1; then we describe the hierarchical encoding process based on the backbone encoder in §3.2; the hierarchical constrastive learning methods considering meta-structure and medical knowledge are elucidated in §3.3; at last, we elicit the applications of the proposed framework in §3.4.

Backbone encoder: TrialBERT
We leverage the BERT architecture as the backbone encoder in the framework.In detail, we use the WordPiece tokenizer together with the BioBERT (Lee et al., 2020) pretrained weights as the start point.We continue the pretraining with Masked Language Modeling (MLM) loss on three trialrelated data sources: ClinicalTrial.gov 3 , Medical Encyclopedia4 , and Wikipedia Articles5 , see Table 6, to get TrialBERT.ClinicalTrials.gov is a database that contains around 400k clinical trials conducted in 220 countries.Medical Encyclopedia has 4K high-quality articles introducing terminologies in medicine.We also retrieve relevant Wikipedia articles corresponding to the 4k terminologies of Medical Encyclopedia.

Global and local embeddings by Trial2Vec
TrialBERT embeddings pretrained with MLM on clinical corpora still hold weak semantic meaning.Meanwhile, previous sentence embedding BERTs all take an average pooling over token embeddings, which causes the semantic meaning vanishing when applied to lengthy clinical trials.Therefore, we propose Trial2Vec architecture that exploits the global and local embeddings for trial based on its meta-structure.
We split the attributes of a trial into two distinct sets: key attributes and contexts.The first component includes the trial title, intervention, condition, and main measurement, which are sufficient to retrieve a pool of coarsely relevant trial candidates; the second includes descriptions, eligibility criteria, references, etc., which differentiate trials targeting similar diseases or interventions because they provide the multi-facet details regarding disease phases, study designs, targeted populations, etc.According to this design, local embeddings {v att } L l=1 ∈ R L×D are produced separately on each key attribute.On the other hand, a context embedding is obtained by encoding the context texts v ctx ∈ R D .Note that the above encoding is all conducted by the same encoder.
We further refine the local embeddings by context embeddings and aggregate them to yield the global trial embedding v g ∈ R D .The refinement is performed by multi-head attention, as (1) which relocates the attention over key attributes to enhance discrminative power of the yielded global embedding.

Hierarchical contrastive learning
For data-efficient contrastive learning, we utilize the meta-structure & medical knowledge for contrasting local and global embeddings hierarchically.
Global contrastive loss.The first objective is to maximize the semantic in trial embeddings for similarity search.Instead of doing in-batch instancewise contrastive loss like SimCSE, we propose to sample informative negative pairs by exploiting the trial meta-structure.As shown by Fig. 1, some trials may be linked by a common attribute like disease or intervention.Denote a trial consisting of several attributes by we can build an informative negative sample by replacing its title with a trial which also targets for disease x dise by Meanwhile, we apply a random attribute dropout towards x to formulate a positive sample as InfoNCE loss is utilized in a batch of B trials as where the negative sample set V − i = {v − gi } ∪ {v gj } j̸ =i ; ψ(•, •) measures the cosine similarity between two vectors.The global contrastive loss here encourages the model to capture the attribute of interest by discriminating the subtle differences of input trial attributes, which prevent the semantic meanings from vanishing due to the average pooling over all trial texts.Local contrastive loss.In addition to the global trial embeddings, we put supervision on local embeddings to inject medical knowledge into the model.Unlike general texts, two medical texts can be overlapped word-wise dramatically but still describe two distinct things6 , which is challenging for similarity computing.To strengthen TrialBERT discriminative power for medical texts, we extract key medical entities in each text as7 then a positive sample is built by mapping one entity e 1 to its canonical name or a similar entity under the same parental conception ê1 defined by UMLS as Similarly, negative sample is built by deletion or replacing one entity with another dissimilar one.InfoNCE loss is therefore used by We at last jointly optimize the global and contrastive losses as

Application of global & local embeddings
The hierarchical contrastive learning offers extraordinary flexibility of Trial2Vec for various downstream tasks in zero-shot learning.At first, the global trial embeddings v g can be directly used for similarity search by comparing trial pair-wise cosine similarities.The computed trial embeddings can also help identify and discover research topics when we apply visualization techniques.On the other hand, we can also execute query search using partial attributes crediting to the contrastive learning between local and global embeddings.When we need do trial-level predictive tasks, e.g., trial termination prediction, a classifier can be attached to the pretrained global trial embeddings and learned; the backbone TrialBERT is also capable of offering short medical sentence retrieval because of local contrastive learning.

Experiments
In this section, we conduct five types of experiments to answer the following research questions: • Exp 1 & 2. How does Trial2Vec perform in complete and partial retrieval scenarios?
• Exp 3. How do the proposed SSL tasks / embedding dimension contribute to the performance?
• Exp 4. Is the trial embedding space interpretable and aligned with medical ontology?
• Exp 6. Qualitative analysis of the retrieval results and what are the differences of Trial2Vec and baselines?

Dataset & Setup
Trial Similarity Search.We created a labeled trial dataset to evaluate the retrieval performance where paired trials are labeled as relevant or not.We keep 311,485 interventional trials from the total 399,046 trials.We uniformly sample 160 trials as the query trials.To overcome the sparsity of relevance, we take advantage of TF-IDF (Salton et al., 1983) to retrieve ranked top-10 trials as the candidate to be labeled, resulting in 1,600 labeled pairs of clinical trials.Unlike general documents, the clinical trial document contains many medical terms and formulations.We recruited clinical informatics researchers, and each is assigned 400 pairs to label as relevant or not using label {1, 0}.To keep labeling processes in line, we specify the minimum annotation guide for judging relevance: (1) same disease; or (2) same intervention and similar diseases (e.g., cancer on distinct body parts).We use precision@k (prec@k), recall@k (rec@k), and nDCG@5 to evaluate and report performances.
prec@k = # of relevant trials in the top k results k , (10) rec@k = # of relevant trials in the top k results # of relevant trials in all candidate trials .(11) Trial termination prediction.We can take the pretrained Trial2Vec embeddings for predicting the trial outcomes, i.e., if the trial will be terminated or not.We add one additional fully-connected layer on the tail of Trial2Vec.The targeted outcomes are in the status section of clinical trials, described by Table 2.We formulate the outcome prediction as a binary classification problem to predict the Completion or Termination of trials where we get 210,411 and 34,305 trials as positive and negative labeled, respectively.We take 70% of all as the training set and 20% as the test set; the remaining 10% is used as the validation set for tuning and early stopping.We utilize three metrics for evaluation: accuracy (ACC), area under the Receiver Operating Characteristic (ROC-AUC), and area under Precision-Recall curve (PR-AUC).
We keep all methods' embedding dimensions at 768.We start from a BERT-base model to continue pre-training on clinical domain corpora, yielding our TrialBERT, which supports as the backbone for BERT-Whitening and BERT-SimCSE for fair comparison.We take 5 epochs with batch size 100 and the learning rate 5e-5.In the second SSL training phase, AdamW optimizer with a learning rate of 2e-5, batch size of 50, and weight decay of 1e-4 is used.Experiments were done with 6 RTX 2080 Ti GPUs.

Exp 1. Complete Trial Similarity Search
Since labels are unavailable in the training phase, we only chose unsupervised/self-supervised baselines.Results are shown by Table 3. Trial2Vec outperforms all baselines with a great margin.It has around 15% improvement on each metrics than the best baselines on average.For baselines, all except for TF-IDF have similar performance.When k is small, the precision gap between Trial2Vec and baselines is large; when k is large, all methods encounter precision reduction.That is because the pool of candidate trials are 10 but the number of positive pairs for each are often less than 5, which limits the maximum of the numerator of prec@k in Eq. ( 10).Likewise, Trial2Vec also shows stronger performance in rec@k because it is discounted by the maximum number of positive pairs.
Interestingly, the state-of-the-art sentence BERTs, e.g., BERT-whitening and BERT-simCSE, have limited improvement over original BERT and even Word2Vec.Unlike general documents, clinical trials may be overlapped in much content but still be irrelevant if the key entities are different.This special characteristic causes the assumption of a document with similar passage is relevant (Craswell et al., 2020) used in general document retrieval but invalidated in clinical trial retrieval.Without well-designed SSL, it is hard for these methods to learn these subtle differences.Moreover, clinical trial documents are often much longer than the general documents in those open datasets.Table 4: Trial outcome prediction performances of baselines and Trial2Vec, after fine-tuned.

Exp 2. Partial Query Trial Retrieval
We further investigate the partial trial retrieval scenario where users intend to find similar trials with short and incomplete descriptions, e.g., partial attributes.Results are illustrated by Fig 2 .We start by measuring how well Trial2Vec only utilizes the title for trial retrieval.It is witnessed that using title is sufficient to yield comparable performance as the best baseline for complete retrieval shown in Table 3. Nonetheless, we identify that concatenating keywords or intervention with the title reduces performance.Combining title and disease yields similar performance as involving all attributes.This phenomenon signifies that the disease plays a vital role in trial similarity and is always recommended to be involved in query trial retrieval.

Exp 3. Ablation Studies
We conducted ablation studies to measure how SSL tasks and embedding dimensions contribute to final results.Results are shown by Fig. 3, where we remove one Task for each setting and reevaluate.Here, att mc and ctx mc corresponds to the global contrastive loss by negative sampling on key attributes and contexts, respectively; semantic mc indicates the local contrastive loss.We observe that ctx mc is very important.Without it, only attributes of trials are included in the training and inference of Trial2Vec, thus resulting in a significant performance drop.However, even only using a small segment of trials (the attributes), Trial2Vec still reaches similar performance as BERT-SimCSE that receives the whole trial document as inputs.This demonstrates the importance of picking highquality negative samples during the CL process.Similarly, we observe other two tasks also improve the retrieval quality.
Fig. 4 illustrates the retrieval performance on different embedding dimensions.We identify that reducing embedding dimension does not affect the performance of Trial2Vec much, i.e., one can choose a small embedding dimension (e.g., 128) without suffering much performance degradation while saving lots of storage and computational resources.

Exp 4. Embedding Space Visualization
Fig. 5 plots the 2D visualization of the embedding space of Trial2Vec using t-SNE (Van der Maaten and Hinton, 2008) where around 2k trials uniformly sampled from 300k trials.The tag texts illustrate the target diseases of trials with different colors.We observe that these trials embeddings show interpretable clusters corresponding to target disease categories.More discussions about this visualization can be referred to Appendix B.

Exp 5. Trial Termination Prediction
Results are illustrated by Table 4. Compared with the shallow models, BERT-based methods gain better performance, which credits the deep architecture of transformers with stronger learning capability.Trial2Vec takes a hierarchical encoding for trial documents on meta-structure thus better revealing the trial characteristics, which plays a central role in predicting its potention outcomes.

Exp 6. Case Study
We perform a qualitataive analysis of similarity search results and two baselines.Results are shown in Table 5.These two case studies show that TF-IDF and BERT models all tend to put attention on frequent words in query trials, e.g., blood and iron in case study 1; and heart failure in case study 2. This bias comes from the average pooing taken onto all token embeddings.The top-1 relevant clinical trial retrieved by Trial2Vec, on the other hand, provides a more similar trial thanks to the hierarchical encoding and specific local and global contrastive learning.We add more explanations regarding these cases in Appendix C.

Conclusion
This paper investigated utilizing BERT with selfsupervision for encoding trial into dense embeddings for similarity search.Experiments show our method can succeed in zero-shot trial search under various settings.The embeddings are also useful for trial downstream predictive tasks.The qualitative analysis, including embedding space vi-sualization and case studies, further verifies that Trial2Vec gets a medically meaningful understanding of clinical trials.A Baselines for clinical trial similarity search • TF-IDF (Salton et al., 1983;Salton and Buckley, 1988).It is short for term frequency-inverse document frequency that has been widely used for information retrieval systems for decades.One can use TF-IDF for document retrieval by concatenating scores of all words in this document then computing cosine distance between document vectors.
We run it based on the rank-bm25 package8 with its default hyperparameters.
• Word2Vec (Mikolov et al., 2013).It is a classic dense retrieval method by building distributed word representations by selfsupervised learning methods (CBOW).We take an average pooling of word representations in a document for retrieval by cosine distance.We use gensim9 to run this method.
• BERT.We take an average pooling over all token embeddings at the last layer of it for similarity computation.We take the TrialBERT pretrained on all the clinical trial documents.
• BERT-Whitening (Huang et al., 2021;Su et al., 2021).This is an unsupervised post-processing method that uses anisotropic BERT embeddings (Ethayarajh, 2019;Li et al., 2020) to improve semantic search.We take the average of last and first layer of its BERT embeddings following Su et al. (2021).
• BERT-SimCSE (Gao et al., 2021).It is a contrastive sentence representation learning method stemming from InfoNCE loss.It simply takes other samples in batch as negative samples.
In the second example, the query trial tries to investigate the benefits of Diclofenac for Normotensive patients with acute symptomatic Pulmonary Embolism and Right Ventricular Dysfunction.TF-IDF finds an irrelevant study on the efficacy and safety of Elobixibat for adults with NAFLD or NASH.TrialBERT also retrieves an irrelevant study on Intravascular Volume Expansion to Neuroendocrine-Renal Function Profiles in Chronic Heart Failure.On the other hand, Trial2Vec digs out a trial that studies the same type of drug with a similar purpose as the target's: evaluating the efficiency of NSAID (Diclofenac) to the evolution of postoperative (cardiac surgery) pericardial effusion.

Figure 1 :
Figure 1: Overview of the proposed Trial2Vec framework.Top left: the training strategy that accounts for unlabeled input trial documents with meta-structure along with an external medical knowledge database, e.g., UMLS.Top right: The contrastive supervision splits into meta-structure and knowledge guided, respectively.Bottom left: our method hierarchically encodes trials into local and global embeddings on the trial meta-structure.Bottom right: The encoded trial-level embeddings can be used to trial search, query trial search and downstream tasks.

Figure 2 :
Figure 2: Performance of Trial2Vec on the partial retrieval scenarios.We use a different part of the trial as queries to retrieve similar trials, including keyword kw, intervention intv, disease dz, context ctx.Error bars indicate the 95% confidence interval of results.

Figure 3 :Figure 4 :Figure 5 :
Figure 3: Ablation study on the contribution of each Task to the final result.att, mc, ctx are short for attribute, matching, context, respectively.all indicate the full Trial2Vec that all tasks are used.

Table 1 :
An example of the meta-structure of clinical trial document drawn from ClinicalTrials.gov.

Table 2 :
Statistics of trial status in ClinicalTrials.govdatabase where we conclude Approved & Completed as completion; Suspended, Terminated, and Withdrawn as the termination for trial outcome prediction.

Table 3 :
Precision/Recall and nDCG of the retrieval models on the labeled test set.Values in parenthesis show 95% confidence interval.Best values are in bold.

Table 5 :
Case studies comparing the retrieval performance of the Trial2Vec with baseline models.Due to the space limits, only title and NCT ID of trials are given.

Table 6 :
List of text corpora used for continual pretraining of TrialBERT.