Caregiver strategy classification in pediatric rehabilitation contexts is strongly motivated by real-world clinical constraints but highly under-resourced and seldom studied in natural language processing settings. We introduce a large dataset of 4,037 caregiver strategies in this setting, a five-fold increase over the nearest contemporary dataset. These strategies are manually categorized into clinically established constructs with high agreement (𝜅=0.68-0.89). We also propose two techniques to further address identified data constraints. First, we manually supplement target task data with publicly relevant data from online child health forums. Next, we propose a novel data augmentation technique to generate synthetic caregiver strategies with high downstream task utility. Extensive experiments showcase the quality of our dataset. They also establish evidence that both the publicly available data and the synthetic strategies result in large performance gains, with relative F1 increases of 22.6% and 50.9%, respectively.
Natural language processing (NLP) is a fast-paced field and a popular course topic in many undergraduate and graduate programs. This paper presents a comprehensive suite of example-driven course slides covering NLP concepts, ranging from fundamental building blocks to modern state-of-the-art approaches. In contributing these slides, I hope to alleviate burden for those starting out as faculty or in need of course material updates. The slides are publicly available for external use and are updated regularly to incorporate new advancements.
Spoken language presents a compelling medium for non-invasive Alzheimer’s disease (AD) screening, and prior work has examined the use of fine-tuned pretrained language models (PLMs) for this purpose. However, PLMs are often optimized on tasks that are inconsistent with AD classification. Spoken language corpora for AD detection are also small and disparate, making generalizability difficult. This paper investigates the use of domain-adaptive prompt fine-tuning for AD detection, using AD classification loss as the training objective and leveraging spoken language corpora from a variety of language tasks. Extensive experiments using voting-based combinations of different prompting paradigms show an impressive mean detection F1=0.8952 (with std=0.01 and best F1=0.9130) for the highest-performing approach when using BERT as the base PLM.
Human evaluations are indispensable in the development of NLP systems because they provide direct insights into how effectively these systems meet real-world needs and expectations. Ensuring the reproducibility of these evaluations is vital for maintaining credibility in natural language processing research. This paper presents our reproduction of the human evaluation experiments conducted by Hosking et al. (2022) for their paraphrase generation approach. Through careful replication we found that our results closely align with those in the original study, indicating a high degree of reproducibility.
Recent research highlights the importance of figurative language as a tool for amplifying emotional impact. In this paper, we dive deeper into this phenomenon and outline our methods for Track 1, Empathy Prediction in Conversations (CONV-dialog) and Track 2, Empathy and Emotion Prediction in Conversation Turns (CONV-turn) of the WASSA 2024 shared task. We leveraged transformer-based large language models augmented with figurative language prompts, specifically idioms, metaphors and hyperbole, that were selected and trained for each track to optimize system performance. For Track 1, we observed that a fine-tuned BERT with metaphor and hyperbole features outperformed other models on the development set. For Track 2, DeBERTa, with different combinations of figurative language prompts, performed well for different prediction tasks. Our method provides a novel framework for understanding how figurative language influences emotional perception in conversational contexts. Our system officially ranked 4th in the 1st track and 3rd in the 2nd track.
Empathy is a social mechanism used to support and strengthen emotional connection with others, including in online communities. However, little is currently known about the nature of these online expressions, nor the particular factors that may lead to their improved detection. In this work, we study the role of a specific and complex subcategory of linguistic phenomena, figurative language, in online expressions of empathy. Our extensive experiments reveal that incorporating features regarding the use of metaphor, idiom, and hyperbole into empathy detection models improves their performance, resulting in impressive maximum F1 scores of 0.942 and 0.809 for identifying posts without and with empathy, respectively.
Empathy is critical for effective communication and mental health support, and in many online health communities people anonymously engage in conversations to seek and provide empathetic support. The ability to automatically recognize and detect empathy contributes to the understanding of human emotions expressed in text, therefore advancing natural language understanding across various domains. Existing empathy and mental health-related corpora focus on broader contexts and lack domain specificity, but similarly to other tasks (e.g., learning distinct patterns associated with COVID-19 versus skin allergies in clinical notes), observing empathy within different domains is crucial to providing tailored support. To address this need, we introduce AcnEmpathize, a dataset that captures empathy expressed in acne-related discussions from forum posts focused on its emotional and psychological effects. We find that transformer-based models trained on our dataset demonstrate excellent performance at empathy classification. Our dataset is publicly released to facilitate analysis of domain-specific empathy in online conversations and advance research in this challenging and intriguing domain.
In pediatric rehabilitation services, one intervention approach involves using solution-focused caregiver strategies to support children in their daily life activities. The manual sharing of these strategies is not scalable, warranting need for an automated approach to recognize and select relevant strategies. We introduce CareCorpus, a dataset of 780 real-world strategies written by caregivers. Strategies underwent dual-annotation by three trained annotators according to four established rehabilitation classes (i.e., environment/context, n=325 strategies; a child’s sense of self, n=151 strategies; a child’s preferences, n=104 strategies; and a child’s activity competences, n=62 strategies) and a no-strategy class (n=138 instances) for irrelevant or indeterminate instances. The average percent agreement was 80.18%, with a Cohen’s Kappa of 0.75 across all classes. To validate this dataset, we propose multi-grained classification tasks for detecting and categorizing strategies, and establish new performance benchmarks ranging from F1=0.53-0.79. Our results provide a first step towards a smart option to sort caregiver strategies for use in designing pediatric rehabilitation care plans. This novel, interdisciplinary resource and application is also anticipated to generalize to other pediatric rehabilitation service contexts that target children with developmental need.
Naively assuming English as a source language may hinder cross-lingual transfer for many languages by failing to consider the importance of language contact. Some languages are more well-connected than others, and target languages can benefit from transferring from closely related languages; for many languages, the set of closely related languages does not include English. In this work, we study the impact of source language for cross-lingual transfer, demonstrating the importance of selecting source languages that have high contact with the target language. We also construct a novel benchmark dataset for close contact Chinese-Japanese-Korean-Vietnamese (CJKV) languages to further encourage in-depth studies of language contact. To comprehensively capture contact between these languages, we propose to integrate Romanized transcription beyond textual scripts via Contrastive Learning objectives, leading to enhanced cross-lingual representations and effective zero-shot cross-lingual transfer.
We introduce the Humanistic Buddhism Corpus (HBC), a dataset containing over 80,000 Chinese-English parallel phrases extracted and translated from publications in the domain of Buddhism. HBC is one of the largest free domain-specific datasets that is publicly available for research, containing text from both classical and modern Chinese. Moreover, since HBC originates from religious texts, many phrases in the dataset contain metaphors and symbolism, and are subject to multiple interpretations. Compared to existing machine translation datasets, HBC presents difficult unique challenges. In this paper, we describe HBC in detail. We evaluate HBC within a machine translation setting, validating its use by establishing performance benchmarks using a Transformer model with different transfer learning setups.
Identifying early markers of Alzheimer’s disease (AD) trajectory enables intervention in early disease stages when our currently-available interventions are most likely to be beneficial. Research has shown that alterations in speech, as well as linguistic and semantic deviations in spontaneous conversation detected using natural language processing, manifest early in AD prior to some other observed cognitive deficits. Recent studies show that cerebrospinal fluid (CSF) levels serve as useful early biomarkers for identifying early AD, but CSF biomarkers are challenging to collect. A simpler alternative that has seen very rapid development is based on the use of plasma biomarkers as a blood draw is minimally invasive. Associating verbal and nonverbal characteristics from speech data with CSF and plasma biomarkers may open the door to less invasive, more efficient methods for early AD detection. We present SLaCAD, a new dataset to facilitate this process. We describe our data collection procedures, analyze the resulting corpus, and present preliminary findings that relate measures extracted from the audio and transcribed text to clinical diagnoses, CSF levels, and plasma biomarkers. Our findings demonstrate the feasibility of this and indicate that the collected data can be used to improve assessments of early AD.
Contemporary NLP has rapidly progressed from feature-based classification to fine-tuning and prompt-based techniques leveraging large language models. Many of these techniques remain understudied in the context of real-world, clinically enriched spontaneous dialogue. We fill this gap by systematically testing the efficacy and overall performance of a wide variety of NLP techniques ranging from feature-based to in-context learning on transcribed speech collected from patients with bipolar disorder, schizophrenia, and healthy controls taking a focused, clinically-validated language test. We observe impressive utility of a range of feature-based and language modeling techniques, finding that these approaches may provide a plethora of information capable of upholding clinical truths about these subjects. Building upon this, we establish pathways for future research directions in automated detection and understanding of psychiatric conditions.
Disentangling underlying factors contributing to the expression of emotion in multimodal data is challenging but may accelerate progress toward many real-world applications. In this paper we describe our approach for solving SemEval-2024 Task #3, Sub-Task #1, focused on identifying utterance-level emotions and their causes using the text available from the multimodal F.R.I.E.N.D.S. television series dataset. We propose to disjointly model emotion detection and causal span detection, borrowing a paradigm popular in question answering (QA) to train our model. Through our experiments we find that (a) contextual utterances before and after the target utterance play a crucial role in emotion classification; and (b) once the emotion is established, detecting the causal spans resulting in that emotion using our QA-based technique yields promising results.
Health-related speech datasets are often small and varied in focus. This makes it difficult to leverage them to effectively support healthcare goals. Robust transfer of linguistic features across different datasets orbiting the same goal carries potential to address this concern. To test this hypothesis, we experiment with domain adaptation (DA) techniques on heterogeneous spoken language data to evaluate generalizability across diverse datasets for a common task: dementia detection. We find that adapted models exhibit better performance across conversational and task-oriented datasets. The feature-augmented DA method achieves a 22% increase in accuracy adapting from a conversational to task-specific dataset compared to a jointly trained baseline. This suggests promising capacity of these techniques to allow for productive use of disparate data for a complex spoken language healthcare task.
It might reasonably be expected that running multiple experiments for the same task using the same data and model would yield very similar results. Recent research has, however, shown this not to be the case for many NLP experiments. In this paper, we report extensive coordinated work by two NLP groups to run the training and testing pipeline for three neural text simplification models under varying experimental conditions, including different random seeds, run-time environments, and dependency versions, yielding a large number of results for each of the three models using the same data and train/dev/test set splits. From one perspective, these results can be interpreted as shedding light on the reproducibility of evaluation results for the three NTS models, and we present an in-depth analysis of the variation observed for different combinations of experimental conditions. From another perspective, the results raise the question of whether the averaged score should be considered the ‘true’ result for each model.
Recognizing medical self-disclosure is important in many healthcare contexts, but it has been under-explored by the NLP community. We conduct a three-pronged investigation of this task. We (1) manually expand and refine the only existing medical self-disclosure corpus, resulting in a new, publicly available dataset of 3,919 social media posts with clinically validated labels and high compatibility with the existing task-specific protocol. We also (2) study the merits of pretraining task domain and text style by comparing Transformer-based models for this task, pretrained from general, medical, and social media sources. Our BERTweet condition outperforms the existing state of the art for this task by a relative F1 score increase of 16.73%. Finally, we (3) compare data augmentation techniques for this task, to assess the extent to which medical self-disclosure data may be further synthetically expanded. We discover that this task poses many challenges for data augmentation techniques, and we provide an in-depth analysis of identified trends.
This paper presents a partial reproduction study of Data-to-text Generation with Macro Planning by Puduppully et al. (2021). This work was conducted as part of the ReproHum project, a multi-lab effort to reproduce the results of NLP papers incorporating human evaluations. We follow the same instructions provided by the authors and the ReproHum team to the best of our abilities. We collect preference ratings for the following evaluation criteria in order: conciseness, coherence, and grammaticality. Our results are highly correlated with the original experiment. Nonetheless, we believe the presented results are insufficent to conclude that the Macro system proposed and developed by the original paper is superior compared to other systems. We suspect combining our results with the three other reproductions of this paper through the ReproHum project will paint a clearer picture. Overall, we hope that our work is a step towards a more transparent and reproducible research landscape.
We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.
Evidence has demonstrated the presence of similarities in language use across people with various mental health conditions. In this work, we investigate these correlations both in terms of literature and as a data analysis problem. We also introduce a novel state-of-the-art transfer learning-based approach that learns from linguistic feature spaces of previous conditions and predicts unknown ones. Our model achieves strong performance, with F1 scores of 0.75, 0.80, and 0.76 at detecting depression, stress, and suicidal ideation in a first-of-its-kind transfer task and offering promising evidence that language models can harness learned patterns from known mental health conditions to aid in their prediction of others that may lie latent.
The reproducibility of NLP research has drawn increased attention over the last few years. Several tools, guidelines, and metrics have been introduced to address concerns in regard to this problem; however, much work still remains to ensure widespread adoption of effective reproducibility standards. In this work, we review the reproducibility of Exploring Neural Text Simplification Models by Nisioi et al. (2017), evaluating it from three main aspects: data, software artifacts, and automatic evaluations. We discuss the challenges and issues we faced during this process. Furthermore, we explore the adequacy of current reproducibility standards. Our code, trained models, and a docker container of the environment used for training and evaluation are made publicly available.
Automatic speech recognition (ASR) systems usually incorporate postprocessing mechanisms to remove disfluencies, facilitating the generation of clear, fluent transcripts that are conducive to many downstream NLP tasks. However, verbal disfluencies have proved to be predictive of dementia status, although little is known about how various types of verbal disfluencies, nor automatically detected disfluencies, affect predictive performance. We experiment with an off-the-shelf disfluency annotator to tag disfluencies in speech transcripts for a well-known cognitive health assessment task. We evaluate the performance of this model on detecting repetitions and corrections or retracing, and measure the influence of gold annotated versus automatically detected verbal disfluencies on dementia detection through a series of experiments. We find that removing both gold and automatically-detected disfluencies negatively impacts dementia detection performance, degrading classification accuracy by 5.6% and 3% respectively
Task-oriented dialogue systems are increasingly prevalent in healthcare settings, and have been characterized by a diverse range of architectures and objectives. Although these systems have been surveyed in the medical community from a non-technical perspective, a systematic review from a rigorous computational perspective has to date remained noticeably absent. As a result, many important implementation details of healthcare-oriented dialogue systems remain limited or underspecified, slowing the pace of innovation in this area. To fill this gap, we investigated an initial pool of 4070 papers from well-known computer science, natural language processing, and artificial intelligence venues, identifying 70 papers discussing the system-level implementation of task-oriented dialogue systems for healthcare applications. We conducted a comprehensive technical review of these papers, and present our key findings including identified gaps and corresponding recommendations.
Dementia often manifests in dialog through specific behaviors such as requesting clarification, communicating repetitive ideas, and stalling, prompting conversational partners to probe or otherwise attempt to elicit information. Dialog act (DA) sequences can have predictive power for dementia detection through their potential to capture these meaningful interaction patterns. However, most existing work in this space relies on content-dependent features, raising questions about their generalizability beyond small reference sets or across different cognitive tasks. In this paper, we adapt an existing DA annotation scheme for two different cognitive tasks present in a popular dementia detection dataset. We show that a DA tagging model leveraging neural sentence embeddings and other information from previous utterances and speaker tags achieves strong performance for both tasks. We also propose content-free interaction features and show that they yield high utility in distinguishing dementia and control subjects across different tasks. Our study provides a step toward better understanding how interaction patterns in spontaneous dialog affect cognitive modeling across different tasks, which carries implications for the design of non-invasive and low-cost cognitive health monitoring tools for use at scale.
The availability of source code has been put forward as one of the most critical factors for improving the reproducibility of scientific research. This work studies trends in source code availability at major computational linguistics conferences, namely, ACL, EMNLP, LREC, NAACL, and COLING. We observe positive trends, especially in conferences that actively promote reproducibility. We follow this by conducting a reproducibility study of eight papers published in EMNLP 2021, finding that source code releases leave much to be desired. Moving forward, we suggest all conferences require self-contained artifacts and provide a venue to evaluate such artifacts at the time of publication. Authors can include small-scale experiments and explicit scripts to generate each result to improve the reproducibility of their work.
NLP offers a myriad of opportunities to support mental health research. However, prior work has almost exclusively focused on social media data, for which diagnoses are difficult or impossible to validate. We present a first-of-its-kind dataset of manually transcribed interactions with people clinically diagnosed with bipolar disorder and schizophrenia, as well as healthy controls. Data was collected through validated clinical tasks and paired with diagnostic measures. We extract 100+ temporal, sentiment, psycholinguistic, emotion, and lexical features from the data and establish classification validity using a variety of models to study language differences between diagnostic groups. Our models achieve strong classification performance (maximum F1=0.93-0.96), and lead to the discovery of interesting associations between linguistic features and diagnostic class. It is our hope that this dataset will offer high value to clinical and NLP researchers, with potential for widespread broader impacts.
Deploying recent natural language processing innovations to low-resource settings allows for state-of-the-art research findings and applications to be accessed across cultural and linguistic borders. One low-resource setting of increasing interest is code-switching, the phenomenon of combining, swapping, or alternating the use of two or more languages in continuous dialogue. In this paper, we introduce a large dataset (20k+ instances) to facilitate investigation of Tagalog-English code-switching, which has become a popular mode of discourse in Philippine culture. Tagalog is an Austronesian language and former official language of the Philippines spoken by over 23 million people worldwide, but it and Tagalog-English are under-represented in NLP research and practice. We describe our methods for data collection, as well as our labeling procedures. We analyze our resulting dataset, and finally conclude by providing results from a proof-of-concept regression task to establish dataset validity, achieving a strong performance benchmark (R2=0.797-0.909; RMSE=0.068-0.057).
The COVID-19 pandemic and other global health events are unfortunately excellent environments for the creation and spread of misinformation, and the language associated with health misinformation may be typified by unique patterns and linguistic markers. Allowing health misinformation to spread unchecked can have devastating ripple effects; however, detecting and stopping its spread requires careful analysis of these linguistic characteristics at scale. We analyze prior investigations focusing on health misinformation, associated datasets, and detection of misinformation during health crises. We also introduce a novel dataset designed for analyzing such phenomena, comprised of 2.8 million news articles and social media posts spanning the early 1900s to the present. Our annotation guidelines result in strong agreement between independent annotators. We describe our methods for collecting this data and follow this with a thorough analysis of the themes and linguistic features that appear in information versus misinformation. Finally, we demonstrate a proof-of-concept misinformation detection task to establish dataset validity, achieving a strong performance benchmark (accuracy = 75%; F1 = 0.7).
Misogynistic memes are rampant on social media, and often convey their messages using multimodal signals (e.g., images paired with derogatory text or captions). However, to date very few multimodal systems have been leveraged for the detection of misogynistic memes. Recently, researchers have turned to contrastive learning, and most notably OpenAI’s CLIP model, is an innovative solution to a variety of multimodal tasks. In this work, we experiment with contrastive learning to address the detection of misogynistic memes within the context of SemEval 2022 Task 5. Although our model does not achieve top results, these experiments provide important exploratory findings for this task. We conduct a detailed error analysis, revealing promising clues and offering a foundation for follow-up work.
Euphemisms are often used to drive rhetoric, but their automated recognition and interpretation are under-explored. We investigate four methods for detecting euphemisms in sentences containing potentially euphemistic terms. The first three linguistically-motivated methods rest on an understanding of (1) euphemism’s role to attenuate the harsh connotations of a taboo topic and (2) euphemism’s metaphorical underpinnings. In contrast, the fourth method follows recent innovations in other tasks and employs transfer learning from a general-domain pre-trained language model. While the latter method ultimately (and perhaps surprisingly) performed best (F1 = 0.74), we comprehensively evaluate all four methods to derive additional useful insights from the negative results.
The spread of fake news can have devastating ramifications, and recent advancements to neural fake news generators have made it challenging to understand how misinformation generated by these models may best be confronted. We conduct a feature-based study to gain an interpretative understanding of the linguistic attributes that neural fake news generators may most successfully exploit. When comparing models trained on subsets of our features and confronting the models with increasingly advanced neural fake news, we find that stylistic features may be the most robust. We discuss our findings, subsequent analyses, and broader implications in the pages within.
Self-disclosure in online health conversations may offer a host of benefits, including earlier detection and treatment of medical issues that may have otherwise gone unaddressed. However, research analyzing medical self-disclosure in online communities is limited. We address this shortcoming by introducing a new dataset of health-related posts collected from online social platforms, categorized into three groups (No Self-Disclosure, Possible Self-Disclosure, and Clear Self-Disclosure) with high inter-annotator agreement (_k_=0.88). We make this data available to the research community. We also release a predictive model trained on this dataset that achieves an accuracy of 81.02%, establishing a strong performance benchmark for this task.
In this work we describe and analyze a supervised learning system for word emphasis selection in phrases drawn from visual media as a part of the Semeval 2020 Shared Task 10. More specifically, we begin by briefly introducing the shared task problem and provide an analysis of interesting and relevant features present in the training dataset. We then introduce our LSTM-based model and describe its structure, input features, and limitations. Our model ultimately failed to beat the benchmark score, achieving an average match() score of 0.704 on the validation data (0.659 on the test data) but predicted 84.8% of words correctly considering a 0.5 threshold. We conclude with a thorough analysis and discussion of erroneous predictions with many examples and visualizations.
Automating straightforward clinical tasks can reduce workload for healthcare professionals, increase accessibility for geographically-isolated patients, and alleviate some of the economic burdens associated with healthcare. A variety of preliminary screening procedures are potentially suitable for automation, and one such domain that has remained underexplored to date is that of structured clinical interviews. A task-specific dialogue agent is needed to automate the collection of conversational speech for further (either manual or automated) analysis, and to build such an agent, a dialogue manager must be trained to respond to patient utterances in a manner similar to a human interviewer. To facilitate the development of such an agent, we propose an annotation schema for assigning dialogue act labels to utterances in patient-interviewer conversations collected as part of a clinically-validated cognitive health screening task. We build a labeled corpus using the schema, and show that it is characterized by high inter-annotator agreement. We establish a benchmark dialogue act classification model for the corpus, thereby providing a proof of concept for the proposed annotation schema. The resulting dialogue act corpus is the first such corpus specifically designed to facilitate automated cognitive health screening, and lays the groundwork for future exploration in this area.
Alzheimers disease is an irreversible brain disease that slowly destroys memory skills andthinking skills leading to the need for full-time care. Early detection of Alzheimer’s dis-ease is fundamental to slow down the progress of the disease. In this work we are developing Natural Language Processing techniques to detect linguistic characteristics of patients suffering Alzheimer’s Disease and related Dementias. We are proposing a neural model based on a CNN-LSTM architecture that is able to take in consideration both long language samples and hand-crafted linguistic features to distinguish between dementia affected and healthy patients. We are exploring the effects of the introduction of an attention mechanism on both our model and the actual state of the art. Our approach is able to set a new state-of-the art on the DementiaBank dataset achieving an F1 Score of 0.929 in the Dementia patients classification Supplementary material include code to run the experiments.
Visual storytelling is an intriguing and complex task that only recently entered the research arena. In this work, we survey relevant work to date, and conduct a thorough error analysis of three very recent approaches to visual storytelling. We categorize and provide examples of common types of errors, and identify key shortcomings in current work. Finally, we make recommendations for addressing these limitations in the future.
Detecting sarcasm in text is a particularly challenging problem in computational semantics, and its solution may vary across different types of text. We analyze the performance of a domain-general sarcasm detection system on datasets from two very different domains: Twitter, and Amazon product reviews. We categorize the errors that we identify with each, and make recommendations for addressing these issues in NLP systems in the future.
The automatic generation of stimulating questions is crucial to the development of intelligent cognitive exercise applications. We developed an approach that generates appropriate Questioning the Author queries based on novel metaphors in diverse syntactic relations in literature. We show that the generated questions are comparable to human-generated questions in terms of naturalness, sensibility, and depth, and score slightly higher than human-generated questions in terms of clarity. We also show that questions generated about novel metaphors are rated as cognitively deeper than questions generated about non- or conventional metaphors, providing evidence that metaphor novelty can be leveraged to promote cognitive exercise.
Crowdsourcing offers a convenient means of obtaining labeled data quickly and inexpensively. However, crowdsourced labels are often noisier than expert-annotated data, making it difficult to aggregate them meaningfully. We present an aggregation approach that learns a regression model from crowdsourced annotations to predict aggregated labels for instances that have no expert adjudications. The predicted labels achieve a correlation of 0.594 with expert labels on our data, outperforming the best alternative aggregation method by 11.9%. Our approach also outperforms the alternatives on third-party datasets.