Adapting Event Extractors to Medical Data: Bridging the Covariate Shift

We tackle the task of adapting event extractors to new domains without labeled data, by aligning the marginal distributions of source and target domains. As a testbed, we create two new event extraction datasets using English texts from two medical domains: (i) clinical notes, and (ii) doctor-patient conversations. We test the efficacy of three marginal alignment techniques: (i) adversarial domain adaptation (ADA), (ii) domain adaptive fine-tuning (DAFT), and (iii) a new instance weighting technique based on language model likelihood scores (LIW). LIW and DAFT improve over a no-transfer BERT baseline on both domains, but ADA only improves on notes. Deeper analysis of performance under different types of shifts (e.g., lexical shift, semantic shift) explains some of the variations among models. Our best-performing models reach F1 scores of 70.0 and 72.9 on notes and conversations respectively, using no labeled target data.


Introduction
Events are an important phenomenon in the field of computational semantics. They offer an intuitive mechanism for constructing structured representations of text, which can be used for downstream tasks such as question answering and summarization. Events also embody a crucial function of language: the ability to report happenings. Narratives from many diverse domains (e.g., news articles, literary texts, clinical notes) use events as basic building blocks. These characteristics make event extraction a key sub-task of interest for text understanding pipelines in multiple domains. Despite its importance, building high-performing and generalizable systems for event extraction has remained an elusive goal. One of the major hurdles is that the notion of what counts as an important event is usually task-specific or domain-specific (sometimes both). For example, to build a system that can track a patient's disease progression from clinical notes, event extractors only need to focus on extracting medical events relevant to that illness. This task/domain specificity has encouraged prior work to focus on specific event types (Grishman and Sundheim, 1996;Doddington et al., 2004;Kim et al., 2008) or domains (Pustejovsky et al., 2003b;Sims et al., 2019). Owing to this narrow focus, supervised event extractors often fail to adapt to new domains or event types (Keith et al., 2017). Unsupervised event extractors that use syntactic rulebased modules (Saurí et al., 2005;Chambers et al., 2014), conversely, have a tendency to over-generate by labeling most verbs and nouns as events.
In this work, we try to achieve a balance between these extremes by adapting event extractors using unsupervised domain adaptation techniques. We also study the behavior of these techniques under various types of linguistic shifts (e.g., lexical shift, semantic shift) to gain insight into differences among them. Exploring adaptability under no (or little) supervision is crucial, since sourcing annotated data for new domains, especially medical texts, can be expensive and time-consuming. Following prior work, we formulate event extraction as the task of labeling triggers, i.e., words which instantiate an event (Linguistic Data Consortium, 2005). For example, in the sentence "She was diagnosed with cancer," diagnosed and cancer are triggers, referring to "diagnosis" and "illness" events respectively. Throughout our work, we model event trigger labeling as token-level classification.
To test adaptability, we create new event extraction test sets using English texts from two diverse medical domains: (i) clinical notes, and (ii) doctorpatient conversations. We develop comprehensive event annotation guidelines, based on TimeML (Pustejovsky et al., 2003a) and Thyme-TimeML (Styler IV et al., 2014) ( §3), and use them to annotate 45 documents from each domain. As a baseline, we train a BERT-based event extraction model on English news articles from TimeBank (Pustejovsky et al., 2003b), which is labeled using TimeML, and test its performance on our datasets. To improve this out-of-domain baseline performance, we tackle the problem of covariate shift, i.e., differences between marginal distributions of source (news) and target domains (notes or conversations). We experiment with three marginal alignment techniques: (i) adversarial domain adaptation (ADA) (Ganin and Lempitsky, 2015), (ii) domain-adaptive fine-tuning (DAFT) (Han and Eisenstein, 2019), and (iii) a new instance weighting scheme using language model likelihood scores (LIW).
Our results show that DAFT and LIW improve over BERT on both domains, whereas ADA only improves on notes. Across domains, ADA and DAFT perform best on notes and conversations respectively. To probe why some techniques are better at addressing certain source-target domain pairs, we analyze model performance on various types of covariate shifts (e.g., lexical shift, semantic shift). Our analysis uncovers interesting patterns such as varying ability of models to leverage subword morphology to generalize to technical terms, and LIW's performance improvement on long-term state events (e.g., chronic illnesses). Our best models achieve F1 scores of 70.0 and 72.9 on notes and conversations respectively with no training data. 1 2 Related Work
Event extraction granularity divides extraction paradigms into two types: (i) document-level paradigms that assume that a piece of text refers to a single event (Grishman and Sundheim, 1996), and (ii) sentence-level paradigms that assume that a single sentence describes one or more events. Event representation also divides extraction paradigms into two types: (i) span-based paradigms that represent events by marking text spans that refer to events, called triggers or nuggets (Linguistic Data Consortium, 2005;Mitamura et al., 2015;O'Gorman et al., 2016), and (ii) structured paradigms that represent events by marking text spans and adding additional arguments (e.g., participants, location etc.) to create a structured template (Grishman and Sundheim, 1996). Event categorization divides extraction paradigms into: (i) ontologydriven paradigms that are limited to specific event types (Grishman and Sundheim, 1996;Doddington et al., 2004), and (ii) ontology-free paradigms that do not place type restrictions (Pustejovsky et al., 2003b;Araki and Mitamura, 2018). We use a sentence-level, span-based, ontologyfree event extraction paradigm. Sentence-level extraction suits our domains of interest since notes and conversations tend to discuss multiple events. Span-based and ontology-free extraction allows us to develop adaptable coding guidelines since event arguments and types are usually domain-specific or task-specific. This adaptability sets our work apart from other prior work on medical event extraction such as adverse drug event extraction (Nikfarjam et al., 2015;Cocos et al., 2017;Henry et al., 2020) and personal event extraction from online support groups (Wen et al., 2013;Naik et al., 2017), which focus on specific event types. Our guidelines draw heavily from the Thyme-TimeML guidelines (Styler IV et al., 2014) used by the Clinical TempEval challenges on event ordering in clinical notes (Bethard et al., 2015(Bethard et al., , 2016(Bethard et al., , 2017, 2 but also cover event extraction in a novel domain: doctor-patient conversations.

Unsupervised Domain Adaptation
Unsupervised domain adaptation is the task of transferring a model from a source domain to a target domain, using only unlabeled data from the target domain, by aligning source and target distri-butions. Early approaches such as structural correspondence learning (SCL) (Blitzer et al., 2006(Blitzer et al., , 2007 tried to solve this by mapping source and target examples into a shared pivot feature space, where pivot features are selected to be features that behave the same way for discriminative learning in both domains (e.g., sentiment terms such as amazing and great show similar behavior for sentiment analysis across domains). With advances in neural representation learning, autoencoder-based methods (Glorot et al., 2011;, neural SCL (Ziser and Reichart, 2017), adversarial domain adaptation (Ganin and Lempitsky, 2015;Ganin et al., 2016) and LM fine-tuning methods (Han and Eisenstein, 2019;Gururangan et al., 2020) have shown success in learning a shared space in which source and target domains are aligned. We propose a new method (LIW) which relies on instance weighting via language model likelihood, and contrast it with adversarial domain adaptation (ADA) and domain adaptive fine-tuning (DAFT). These two techniques have shown promise on sequence labeling tasks (Gui et al., 2017;Han and Eisenstein, 2019;Naik and Rosé, 2020), and offer an interesting contrast between approaches that jointly perform alignment and task training (ADA) and approaches that perform these steps sequentially (DAFT). Comparing all three techniques also provides us the opportunity to study which methods adapt better to different kinds of shifts between source and target domains (e.g., shifts in vocabulary, syntax, etc.).

Dataset Creation
To test adaptability of event extraction models, we create a testbed using data from two domains: 1. Clinical Notes: Clinical notes are records documenting physician observations from their interactions with patients. They usually detail various aspects of a patient's care such as present illness, symptoms, medical history, treatments, and test results. They share a thematic structure, though particular specialties (e.g., cardiology) and institutions often incorporate their own modifications. We collected a set of 4999 de-identified clinical notes from 40 specialties, by scraping mtsamples. 3 The notes are reference samples provided by various users, with names and dates edited for confidentiality. They are freely available to print, share, link and distribute, as per website policy. Average 3 https://www.mtsamples.com/ length of a clinical note is 652 tokens. 2. Doctor-Patient Conversations: This data contains human-transcribed, de-identified conversations recorded during physician-patient visits. The conversations often follow a similar schema, with patients describing their symptoms, doctors inquiring about ongoing treatments, and then suggesting potential follow-up treatments/tests. We use a proprietary database of 63,540 conversations covering 53 specialties, collected by Abridge AI Inc. Physicians across a variety of specialties are contracted to record natural in-office conversations with their patients who agree to participate in the research by providing verbal and written consent. Recordings are made on a digital recording device or a smartphone application and are uploaded to a secure server where they are scrubbed of all identifiable information, in accordance with the Health Insurance Portability and Accountability Act of 1996 (HIPAA) privacy rule. De-identified recordings are transcribed and stored in a database, which currently contains over 100,000 recordings dating from 2006 to 2017. Average conversation transcript length is 2309 tokens. These domains exhibit different types of linguistic shifts from the source (news). While both domains exhibit a shift in vocabulary, it is more pronounced in clinical notes since they are written by doctors (experts) who use highly technical terms. Conversely, shifts in syntax are more pronounced in conversations due to the prevalence of repetition, back-channeling, interruptions etc. Semantic shifts are more pronounced in conversations since they contain a higher proportion of hypothetical statements (e.g., when doctors ask questions, make requests or "think out loud") than both notes and news articles which tend to serve as records of actual events. To better evaluate model performance on linguistic shifts, we control for topical variation across domains by limiting our focus to 3 specialties: Cardiovascular/Pulmonary (Cardio), Obstetrics/Gynaecology (Obgyn) and Hematology/Oncology (Onco). These specialties are well-represented in both notes and conversations, and cover events with a variety of temporalities ranging from intervals with fixed duration (e.g., pregnancy), to intervals with indeterminable endpoints (e.g., long-term cardiac failure). Table 1 gives an overview of the number of notes and conversations in each specialty.

Developing Event Annotation Guidelines
We develop a set of coding guidelines for the task of annotating event triggers in documents from these two domains. Our coding guidelines build upon TimeML (Pustejovsky et al., 2003a), a rich specification language for annotation of events and temporal expressions in text, 4 and Thyme-TimeML (Styler IV et al., 2014), a variant of TimeML developed for clinical notes. We start with these guidelines because they use a syntax-driven domainagnostic definition of events, allowing for an adaptable annotation scheme. In TimeML, the term event refers to situations that happen or occur, or circumstances in which something obtains or holds true. This is a broad definition, consistent with Bach's definition of eventualities (Bach, 1986), and the idea of fluents (McCarthy, 2002). Events can be expressed in text by means of tensed or untensed verbs, nominalizations, adjectives, predicative clauses or prepositional phrases. TimeML describes rules to annotate events in all these syntactic categories. Styler IV et al. (2014) adapted these rules for clinical notes. They focused on the THYME corpus of 1254 de-identified notes from the Mayo Clinic, representing two fields in oncology: brain cancer and colon cancer. As a first step, we annotate one document from each of our domains following TimeML and Thyme-TimeML rules. During this phase, we identify cases where it is reasonable to deviate from these guidelines. Deviations from TimeML: Our guidelines 5 differ from TimeML in their treatment of two categories: 1. Activity patterns: Activity patterns are events that are neither pure generics 6 , nor single events 4 The complete TimeML coding manual is available here: https://catalog.ldc.upenn.edu/docs/ LDC2006T08/timeml_annguide_1.2.1.pdf 5 Our complete coding manual, including example annotations, is available at:https://github.com/ aakanksha19/MedicalEventExtraction. 6 Pure generics are events which discuss illnesses/treatments in general, and are not associated with a specific person and time. For example, "there is a benefit to systemic adjuvant chemotherapy." clearly positioned in time. For example, consider the sentence "I take my blood pressure regularly." The event take is not grounded in time. It is also not a pure generic event as it is definitely associated with the speaker. Such events are not annotated in TimeML. However, in our data, these activity patterns occur frequently in crucial contexts such as taking medications, following lifestyle changes suggested by doctors, measuring vital signs, etc. 2. Long-term states: Because TimeML was geared towards the task of temporal ordering, it strictly restricted annotation of stative events to the following types: (i) states associated with a temporal expression, (ii) states undergoing a change within the document, (iii) states introduced by other events, since those can offer temporal cues, and (iv) states associated with the document creation time. However, many stative events in our data don't fit within these strict parameters, but are nevertheless important. The most crucial category is states associated with long-term ongoing illnesses (e.g., "The patient has a long history of COPD").
These event categories are not specific to medical domains only. For example, long-term state events might be salient when extracting personal events from biographies. 7 Similarly activity patterns might be salient when extracting events from scientific procedure manuals. 8 Considering these scenarios, we add rules to extract these two categories of events. We also expand syntactic rules to cover constructions unique to doctor-patient conversations such as repetition, especially for instructions, and hypothetical event annotation in utterances when doctors are "thinking out loud". Deviations from Thyme-TimeML: Our guidelines differ from Thyme-TimeML in their treatment of two categories: 1. Generic events: Thyme-TimeML annotates generic events in sections documenting discussion of risks, plans and alternative strategies. They do so because adding these events to a patient's clinical timeline could be important from a legal perspective, as they help to establish informed consent and knowledge of risk. We do not annotate pure generics, because we do not perceive any domainagnostic utility in annotating them. Note that we annotate verbs of discussion and comprehension which are not generics, so we do not fully ignore events associated with patient consent. For exam-  Table 2: Inter-annotator agreement on entity and event annotation tasks in both domains, measured using chance-corrected Cohen's κ ple, in the sentence "She repeated the potential side effects back to me," repeated is annotated, but effects is not. Thyme-TimeML would annotate both.
2. Entities as events: Thyme-TimeML treats some entities and non-events as events in clinical language. Two categories see this shift in semantic interpretation: (i) Medications, and (ii) Disorders. Both categories contribute significant information to a patient's timeline, and so they are treated as events. Since we are not specifically focused on timeline construction, we do not follow the same reasoning. In particular, medications are not treated as events, while disorders may be treated as events as long as they fit the TimeML definition. To ensure that we do not discard potentially crucial information, we incorporate an additional step in which we annotate entities such as medications, body parts, abnormalities (e.g., rash), etc.

Annotation Process
After incorporating our modifications, we test our guidelines by having two expert annotators annotate one document from each domain. We see high inter-annotator agreement (measured by chancecorrected Cohen's κ) on entity and event annotation, in both domains. Table 2 presents the agreement scores. To create our final datasets, we sample 45 documents from each domain (15 from each specialty). Each document is annotated by one expert. Annotation is carried out using the BRAT stand-off markup interface (Stenetorp et al., 2012). Figure 1 shows a sample clinical note annotated with events and entities. Table 3 gives a brief overview of statistics for our datasets, in comparison with TimeBank (news articles) (Pustejovsky et al., 2003b).

Methods for Marginal Alignment
To adapt event extraction models with no training data, we tackle the problem of covariate shift, which arises when the marginal distribution (or input distribution) P (X) changes between train and test data. Directly applying a supervised model trained on the training set, to the test set might not   perform well due to the gap between training and test distributions. We experiment with several techniques to align the training and test distributions, so that the supervised model transfers better to test data. The techniques can be divided into two types based on the kind of supervision used during alignment: (i) task-guided alignment techniques, and (ii) task-agnostic alignment techniques.

Task-Guided Alignment Techniques
These techniques jointly optimize for two tasks: (i) aligning training and test distributions, and (ii) training an event extraction model. Since the alignment process receives supervision from task training, we refer to these techniques as task-guided alignment techniques. Under this category, we experiment with adversarial domain adaptation. Adversarial Domain Adaptation: Adversarial domain adaptation was proposed by Ganin and Lempitsky (2015), who showed its efficacy on sentiment analysis. Recently, Naik and Rosé (2020) showed its utility in transferring event extraction models between two domains: news and literature. The adversarial domain adaptation framework for event extraction contains three components: (i) representation learner (R) which generates token-level representations for a sequence, (ii) event classifier which identifies events (E), and (iii) domain predic-tor (D) which predicts the domain for the sequence.
The key idea is to train R to generate representations which are predictive for event identification but not predictive for domain prediction, making it more domain-invariant. This aligns training and test distributions by finding a shared feature space in which training and test samples are not distinguishable, while making sure that the feature space is useful for event extraction. The technique relies on an alternating optimization procedure. The first step optimizes D on the domain prediction task, while the second step optimizes both R and E on event identification while subtracting domain prediction loss. For complete mathematical details, we refer the interested reader to Naik and Rosé (2020).

Task-Agnostic Alignment Techniques
These techniques perform training/test distribution alignment and event extraction training sequentially instead of jointly optimizing them. The alignment process does not receive supervision from task training, so these techniques are task-agnostic. We experiment with the following techniques: Domain Adaptive Fine-tuning: Domain adaptive fine-tuning has been proposed as an effective technique for unsupervised adaption of sequence labeling models to challenging domains such as Early Modern English and social media (Han and Eisenstein, 2019). This procedure works as follows: 1. Create a large dataset containing equal proportions of sentences from source and target domains. Fine-tune contextualized embeddings using a masked language modeling objective. 2. Using fine-tuned embeddings, train an event extraction model on labeled source data.
In addition to this setup, we experiment with a variant of this procedure, which uses a syntactic objective function. This variant fine-tunes embeddings on the POS tagging task in step 1. The motivation behind this variant is two-fold. First, we observe that event annotation is heavily syntax-driven, allowing delexicalized models (i.e., models using POS tags instead of words) to achieve high performance ( §5.2). This indicates that infusing more syntactic awareness into embeddings might help performance on the task. Second, syntax might offer an additional basis for generalization, since sentences that look very different lexically, might follow similar syntactic structures. Intuitively, this variant is similar to syntactic relexicalization which has shown success in cross-lingual dependency parsing (Duong et al., 2015). Likelihood-based Instance Weighting: We develop a new instance weighting procedure which uses likelihood scores computed by a language model. Instance selection and instance weighting strategies have frequently been used to perform domain adaptation by correcting for distributional differences (Jiang and Zhai, 2007;Foster et al., 2010;Axelrod et al., 2011;Wang et al., 2017). The main premise is that some samples from out-of-domain data and in-domain data often share some characteristics. Training only on these samples (pruning), or biasing training to focus more on these samples (weighting) can produce models that perform better on out-of-domain data. Motivated by this, our instance weighting procedure works as follows. Let S t = w 1 w 2 ...w n be a sentence from the indomain training set. Let O be a language model trained on raw text from the target domain. We first compute the likelihood of sentence Then we compute a weight for S t as follows: where |N | is the size of in-domain training set. This metric gives a higher weight to in-domain sentences that are more likely under the target domain language model, up-weighting instances that share more characteristics with target domain sentences.
The alpha values are used to weight the loss function, thus biasing the training procedure.

Model Details
The goal of our evaluation is to identify which alignment technique works best for each domain, as well as analyze whether there are specific kinds of source-target shifts that some techniques are better equipped to handle. We choose a strong BERTbased baseline model with no transfer, and evaluate the performance of each alignment technique when applied to this baseline. VERB: Baseline labeling all verbs as events. DELEX: Fully-delexicalized baseline using POS tag embeddings as features, followed by an MLP. BERT: Single-layer BiLSTM over BERT embeddings (Devlin et al., 2019), followed by an MLP, similar to the best-performing model on LitBank   (Sims et al., 2019). CBERT: Similar to BERT, but embeddings are extracted from Clinical-BERT (Alsentzer et al., 2019) BERT-ADA: BERT trained using adversarial domain adaptation. BERT-LIW: BERT trained on data weighted by LM likelihood. We train autoregressive language models over 3 million tokens for each domain. BERT-DAFT: BERT with domain adaptive finetuning. For target domains, we use the same text as BERT-LIW, and extract 3 million tokens from CNN/ DailyMail (Hermann et al., 2015) for news. BERT-DAFT-SYN: BERT with syntactic finetuning on the same text as BERT-DAFT, tagged using Stanford CoreNLP . Complete implementation details are provided in appendix A.

Results
Tables 4 and 5 show the performance of all models when transferring from news data to clinical notes and doctor-patient conversations respectively. From the tables, we see that DELEX is surpris-ingly strong out-of-domain. BERT with no transfer performs well out-of-domain, improving by 8.25 F1 points on average over DELEX. C-BERT also performs well out-of-domain, but does worse than BERT. We attribute this to the fact that fine-tuning only on clinical notes does not improve alignment with the source domain (news), providing no basis for models trained on news to adapt better. BERT-ADA shows mixed results, improving over BERT by 2.4 F1 on notes, but dropping by 1.1 F1 on conversations. BERT-LIW and BERT-DAFT improve upon BERT in both domains. BERT-DAFT shows minor performance drops in-domain, due to some degree of catastrophic forgetting. BERT-DAFT-SYN shows performance drops, both indomain and out-of-domain, in both settings. Unlike syntactic relexicalization work which used noncontextualized embeddings, we use contextualized embeddings, which possess a larger degree of syntactic information, probably reducing the need for syntax-driven training. Another source of errors is POS tagging, since off-the-shelf taggers trained on news will be less accurate on our data. Across domains, the skew between precision and recall is higher on notes, which might stem from the specialized vocabulary dragging down recall.

Analysis and Discussion
Tables 4 and 5 provide an indication of model ability to handle covariate shift. However, covariate shift occurs at multiple layers in language (e.g., lexical level, syntactic level, etc.), leading to different dimensions of variation between domains (e.g., topical variation, genre variation, etc.). Looking at overall model performance does not offer insight into whether there are specific shifts that some models are better at addressing. We dig deeper into this question, focusing on two levels of shift: (i) lexical shift, and (ii) semantic (event type) shift. Variation under lexical shift: We separate model performance on in-vocabulary (IV) and out-ofvocabulary (OOV) tokens. Note that the proportion of events that are OOV is higher in clinical notes (52%) than conversations (20.6%). Tables 6 and 7 present model performance on these token categories. Surprisingly, despite the use of specialized language, OOV performance on clinical notes is higher than conversations for all models except BERT-DAFT. Taking a closer look at the OOV event instances from clinical notes that models identify correctly, we see that a large proportion (54.8%) contain one of three morphological patterns: (i) past tense verbs ending in "-ed", (ii) gerunds ending in "-ing", or (iii) nouns ending in "-tion" or "-sion". These patterns are also common among events in the news domain. For example, past tense verbs often refer to events that have already occurred and gerunds and nouns ending in "-tion" refer to processes. We hypothesize that BERT-based models might be exploiting these morphological regularities to correctly label unseen medical terms (e.g., irrigated, excision, dissected, wheezing, etc.). These patterns are more prevalent in notes (35.6%) than conversations (23.5%), explaining the surprising performance difference. Variation under semantic shift: To determine whether model performance on OOV tokens depends on event type, we randomly sample ∼500 OOV tokens from each domain and label them for event type. We use the same typology as TimeML (State, I-State, Occurrence, Aspectual, Reporting, Perception, I-Action, None), with additional labels for the event types we introduce (ActivityPattern,   LongTermState). 9 We run an ANOVA model with each token per model as an instance (total 5080 instances), noting Event Type, Target (notes/convos), Model (BERT/ADA/LIW/DAFT/DAFT-SYN) and Correctness (1 vs 0). Correctness is the dependent variable, while others are independent variables. We include all pairwise interaction terms and the three way interaction between Event Type, Target and Model. We see a positive main effect of Event Type on Correctness (p < 0.0001), indicating that some event types are more difficult. There are two significant two-way interactions, one between Target and Event type (p < 0.0001), indicating that difficulty of event types differs across sources, and between Model and Event type (p < 0.0001), indicating that which model is better depends on event type. Three way interaction between Model, Event type, and Target is also significant (p < 0.0001), indicating that performance differences between models per event type differs between sources. We interpret differences in performance per event type separately for each source using a Student-t post-hoc analysis to determine which pairwise contrasts are statistically significant. This reveals that in clinical notes, LIW outperforms all models on I-State events (i.e., hypothetical, future or negated states) and LongTermState events, a category never seen in the training data. These improvements might stem from the training algo-9 Examples provided in appendix B rithm used by LIW. LIW up-weights instances in news that resemble medical data, which contains a high proportion of these event categories. Therefore, despite being infrequent in news, they get up-weighted, helping LIW identify them better.

Conclusion
In this work, we focused on unsupervised adaptation of event extractors to new domains by aligning the marginal distributions of source and target domains. We created two event extraction test sets using English texts from two medical domains: (i) clinical notes, and (ii) doctor-patient conversations, and tested the efficacy of three alignment techniques: (i) adversarial domain adaptation (ADA), (ii) domain adaptive fine-tuning (DAFT), and (iii) a new instance weighting technique based on language model likelihood scores (LIW). None of these models consistently outperformed the others, but a deeper analysis of model performance under different types of shifts (e.g., lexical shift, semantic shift) uncovered interesting variations among models. Our best-performing models attained F1 scores of 70.0 and 72.9 on notes and conversations respectively, using no labeled target data. We believe these models define a good starting point and can be further improved using few-shot learning.