Mark Dredze

2025

Making FETCH! Happen: Finding Emergent Dog Whistles Through Common Habitats
Kuleen Sasse | Carlos Alejandro Aguirre | Isabel Cachola | Sharon Levy | Mark Dredze
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Dog whistles are coded expressions with dual meanings: one intended for the general public (outgroup) and another that conveys a specific message to an intended audience (ingroup). Often, these expressions are used to convey controversial political opinions while maintaining plausible deniability and slip by content moderation filters. Identification of dog whistles relies on curated lexicons, which have trouble keeping up to date. We introduce FETCH!, a task for finding novel dog whistles in massive social media corpora. We find that state-of-the-art systems fail to achieve meaningful results across three distinct social media case studies. We present EarShot, a strong baseline system that combines the strengths of vector databases and Large Language Models (LLMs) to efficiently and effectively identify new dog whistles.

pdf bib abs

DnDScore: Decontextualization and Decomposition for Factuality Verification in Long-Form Text Generation
Miriam Wanner | Benjamin Van Durme | Mark Dredze
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

The decompose-then-verify strategy for verification of Large Language Model (LLM) generations decomposes claims that are then independently verified. Decontextualization augments text (claims) to ensure it can be verified outside of the original context, enabling reliable verification. While decomposition and decontextualization have been explored independently, their interactions in a complete system have not been investigated. Their conflicting purposes can create tensions: decomposition isolates atomic facts while decontextualization inserts relevant information. Furthermore, a decontextualized subclaim presents a challenge to the verification step: what part of the augmented text should be verified as it now contains multiple atomic facts? We conduct an evaluation of different decomposition, decontextualization, and verification strategies and find that the choice of strategy matters in the resulting factuality scores. Additionally, we introduce DnDScore, a decontextualization aware verification method that validates subclaims in the context of contextual information.

pdf bib abs

Evaluating the Evaluators: Are readability metrics good measures of readability?
Isabel Cachola | Daniel Khashabi | Mark Dredze
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Plain Language Summarization (PLS) aims to distill complex documents into accessible summaries for non-expert audiences. In this paper, we conduct a thorough survey of PLS literature, and identify that the current standard practice for readability evaluation is to use traditional readability metrics, such as Flesch-Kincaid Grade Level (FKGL). However, despite proven utility in other fields, these metrics have not been compared to human readability judgments in PLS. We evaluate 8 readability metrics and show that most correlate poorly with human judgments, including the most popular metric, FKGL. We then show that Language Models (LMs) are better judges of readability, with the best-performing model achieving a Pearson correlation of 0.56 with human judgments. Extending our analysis to PLS datasets, which contain summaries aimed at non-expert audiences, we find that LMs better capture deeper measures of readability, such as required background knowledge, and lead to different conclusions than the traditional metrics. Based on these findings, we offer recommendations for best practices in the evaluation of plain language summaries.

pdf bib abs

LLMs are Better Than You Think: Label-Guided In-Context Learning for Named Entity Recognition
Fan Bai | Hamid Hassanzadeh | Ardavan Saeedi | Mark Dredze
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

In-context learning (ICL) enables large language models (LLMs) to perform new tasks using only a few demonstrations. In Named Entity Recognition (NER), demonstrations are typically selected based on semantic similarity to the test instance, ignoring training labels and resulting in suboptimal performance. We introduce DEER, a new method that leverages training labels through token-level statistics to improve ICL performance. DEER first enhances example selection with a label-guided, token-based retriever that prioritizes tokens most informative for entity recognition. It then prompts the LLM to revisit error-prone tokens, which are also identified using label statistics, and make targeted corrections. Evaluated on five NER datasets using four different LLMs, DEER consistently outperforms existing ICL methods and approaches the performance of supervised fine-tuning. Further analysis shows its effectiveness on both seen and unseen entities and its robustness in low-resource settings.

pdf bib abs

A Novel Multi-Document Retrieval Benchmark: Journalist Source-Selection in Newswriting
Alexander Spangher | Tenghao Huang | Yiqin Huang | Lucas Spangher | Sewon Min | Mark Dredze
Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing

Multi-document retrieval approaches often overlook the ways different retrievals complement each other when addressing complex queries. In this work, we study journalist source selection in news article writing and examine the discourse roles that different sources serve when paired together, finding that discourse function (not simply informational content) is an important component of source usage. Then, we introduce a novel IR task to benchmark how well language models can reason about this narrative process. We extract a journalist’s initial query and the sources they used from news articles and aim to recover the sources that support this query. We demonstrate that large language models (LLMs) can be employed in multi-step query planning, identifying informational gaps and enhancing retrieval performance, but current approaches to interleave queries fall short. By training auxiliary discourse planners and incorporating this information into LLMs, we enhance query planning, achieving a significant 5% improvement in precision and a 2% increase in F1 score over the previous SOTA, all while maintaining recall.

pdf bib abs

Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions
Hanjie Chen | Zhouxiang Fang | Yash Singla | Mark Dredze
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

LLMs have demonstrated impressive performance in answering medical questions, such as achieving passing scores on medical licensing examinations. However, medical board exams or general clinical questions do not capture the complexity of realistic clinical cases. Moreover, the lack of reference explanations means we cannot easily evaluate the reasoning of model decisions, a crucial component of supporting doctors in making complex medical decisions. To address these challenges, we construct two new datasets: JAMA Clinical Challenge and Medbullets. JAMA Clinical Challenge consists of questions based on challenging clinical cases, while Medbullets comprises simulated clinical questions. Both datasets are structured as multiple-choice question-answering tasks, accompanied by expert-written explanations. We evaluate seven LLMs on the two datasets using various prompts. Experiments demonstrate that our datasets are harder than previous benchmarks. In-depth automatic and human evaluations of model-generated explanations provide insights into the promise and deficiency of LLMs for explainable medical QA.

pdf bib abs

RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models
Bang An | Shiyue Zhang | Mark Dredze
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Efforts to ensure the safety of large language models (LLMs) include safety fine-tuning, evaluation, and red teaming.However, despite the widespread use of the Retrieval-Augmented Generation (RAG) framework, AI safety work focuses on standard LLMs, which means we know little about how RAG use cases change a model’s safety profile. We conduct a detailed comparative analysis of RAG and non-RAG frameworks with eleven LLMs. We find that RAG can make models less safe and change their safety profile. We explore the causes of this change and find that even combinations of safe models with safe documents can cause unsafe generations. In addition, we evaluate some existing red teaming methods for RAG settings and show that they are less effective than when used for non-RAG settings. Our work highlights the need for safety research and red-teaming methods specifically tailored for RAG LLMs.

pdf bib abs

Amuro & Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models
Kaiser Sun | Mark Dredze
Proceedings of the 10th Workshop on Representation Learning for NLP (RepL4NLP-2025)

Large language model development relies on the pre-train-then-align paradigm, in which the model is typically pre-trained on a large text corpus and undergoes a tuning stage to align the model with human preference or downstream tasks. We investigate the relationship between pre-training and supervised fine-tuning by considering multiple tasks as well as different pre-trained model checkpoints. Our results on 18 datasets and two models suggest that i) although the model benefits significantly through supervised fine-tuning, it may forget previously known domain knowledge and tasks that are not seen during fine-tuning; ii) the model exhibits high sensitivity to evaluation prompts after supervised fine-tuning, but this sensitivity can be alleviated through further pre-training; iii) continual pre-training improves the model in a latent way that manifests after fine-tuning; iv) The model can already solve some tasks after pre-training while fine-tuning most benefits datasets where the model does not show capability during pre-training.

2024

pdf bib abs

Can We Statically Locate Knowledge in Large Language Models? Financial Domain and Toxicity Reduction Case Studies
Jordi Armengol-Estapé | Lingyu Li | Sebastian Gehrmann | Achintya Gopal | David S Rosenberg | Gideon S. Mann | Mark Dredze
Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

Current large language model (LLM) evaluations rely on benchmarks to assess model capabilities and their encoded knowledge. However, these evaluations cannot reveal where a model encodes its knowledge, and thus little is known about which weights contain specific information. We propose a method to statically (without forward or backward passes) locate topical knowledge in the weight space of an LLM, building on a prior insight that parameters can be decoded into interpretable tokens. If parameters can be mapped into the embedding space, it should be possible to directly search for knowledge via embedding similarity. We study the validity of this assumption across several LLMs for a variety of concepts in the financial domain and a toxicity detection setup. Our analysis yields an improved understanding of the promises and limitations of static knowledge location in real-world scenarios.

pdf bib abs

Commercially available models dominate academic leaderboards. While impressive, this has concentrated research on creating and adapting general-purpose models to improve NLP leaderboard standings for large language models. However, leaderboards collect many individual tasks and general-purpose models often underperform in specialized domains; domain-specific or adapted models yield superior results. This focus on large general-purpose models excludes many academics and draws attention away from areas where they can make important contributions. We advocate for a renewed focus on developing and evaluating domain- and task-specific models, and highlight the unique role of academics in this endeavor.

pdf bib abs

Do LLMs Plan Like Human Writers? Comparing Journalist Coverage of Press Releases with LLMs
Alexander Spangher | Nanyun Peng | Sebastian Gehrmann | Mark Dredze
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Journalists engage in multiple steps in the news writing process that depend on human creativity, like exploring different “angles” (i.e. the specific perspectives a reporter takes). These can potentially be aided by large language models (LLMs). By affecting planning decisions, such interventions can have an outsize impact on creative output. We advocate a careful approach to evaluating these interventions to ensure alignment with human values.In a case study of journalistic coverage of press releases, we assemble a large dataset of 250k press releases and 650k articles covering them. We develop methods to identify news articles that _challenge and contextualize_ press releases. Finally, we evaluate suggestions made by LLMs for these articles and compare these with decisions made by human journalists. Our findings are three-fold: (1) Human-written news articles that challenge and contextualize press releases more take more creative angles and use more informational sources. (2) LLMs align better with humans when recommending angles, compared with informational sources. (3) Both the angles and sources LLMs suggest are significantly less creative than humans.

pdf bib abs

Multi-Task Transfer Matters During Instruction-Tuning
David Mueller | Mark Dredze | Nicholas Andrews
Findings of the Association for Computational Linguistics: ACL 2024

Instruction-tuning trains a language model on hundreds of tasks jointly to improve a model’s ability to learn in-context;however, the mechanisms that drive in-context learning are poorly understood and, as a result, the role of instruction-tuning on in-context generalization is poorly understood as well.In this work, we study the impact of instruction-tuning on multi-task transfer: how well a model’s parameters adapt to an unseen task via fine-tuning.We find that instruction-tuning negatively impacts a model’s transfer to unseen tasks, and that model transfer and in-context generalization are highly correlated, suggesting that this catastrophic forgetting may impact in-context learning.We study methods to improve model transfer, finding that multi-task training—how well the training tasks are optimized—can significantly impact ICL generalization; additionally, we find that continual training on unsupervised pre-training data can mitigate forgetting and improve ICL generalization as well.Finally, we demonstrate that, early into training, the impact of instruction-tuning on model transfer to tasks impacts in-context generalization on that task.Overall, we provide significant evidence that multi-task transfer is deeply connected to a model’s ability to learn a task in-context.

pdf bib abs

Gender Bias in Decision-Making with Large Language Models: A Study of Relationship Conflicts
Sharon Levy | William Adler | Tahilin Sanchez Karver | Mark Dredze | Michelle R Kaufman
Findings of the Association for Computational Linguistics: EMNLP 2024

Large language models (LLMs) acquire beliefs about gender from training data and can therefore generate text with stereotypical gender attitudes. Prior studies have demonstrated model generations favor one gender or exhibit stereotypes about gender, but have not investigated the complex dynamics that can influence model reasoning and decision-making involving gender. We study gender equity within LLMs through a decision-making lens with a new dataset, DeMET Prompts, containing scenarios related to intimate, romantic relationships. We explore nine relationship configurations through name pairs across three name lists (men, women, neutral). We investigate equity in the context of gender roles through numerous lenses: typical and gender-neutral names, with and without model safety enhancements, same and mixed-gender relationships, and egalitarian versus traditional scenarios across various topics. While all models exhibit the same biases (women favored, then those with gender-neutral names, and lastly men), safety guardrails reduce bias. In addition, models tend to circumvent traditional male dominance stereotypes and side with “traditionally female” individuals more often, suggesting relationships are viewed as a female domain by the models.

pdf bib abs

Evaluating Biases in Context-Dependent Sexual and Reproductive Health Questions
Sharon Levy | Tahilin Sanchez Karver | William Adler | Michelle R Kaufman | Mark Dredze
Findings of the Association for Computational Linguistics: EMNLP 2024

Chat-based large language models have the opportunity to empower individuals lacking high-quality healthcare access to receive personalized information across a variety of topics. However, users may ask underspecified questions that require additional context for a model to correctly answer. We study how large language model biases are exhibited through these contextual questions in the healthcare domain. To accomplish this, we curate a dataset of sexual and reproductive healthcare questions (ContextSRH) that are dependent on age, sex, and location attributes. We compare models’ outputs with and without demographic context to determine answer alignment among our contextual questions. Our experiments reveal biases in each of these attributes, where young adult female users are favored.

pdf bib abs

In this paper, we explore the question of whether large language models can support cost-efficient information extraction from tables. We introduce schema-driven information extraction, a new task that transforms tabular data into structured records following a human-authored schema. To assess various LLM’s capabilities on this task, we present a benchmark comprised of tables from four diverse domains: machine learning papers, chemistry literature, material science journals, and webpages. We use this collection of annotated tables to evaluate the ability of open-source and API-based language models to extract information from tables covering diverse domains and data formats. Our experiments demonstrate that surprisingly competitive performance can be achieved without requiring task-specific pipelines or labels, achieving F1 scores ranging from 74.2 to 96.1, while maintaining cost efficiency. Moreover, through detailed ablation studies and analyses, we investigate the factors contributing to model success and validate the practicality of distilling compact models to reduce API reliance.

pdf bib abs

Transferring Fairness using Multi-Task Learning with Limited Demographic Information
Carlos Alejandro Aguirre | Mark Dredze
Proceedings of the Third Workshop on NLP for Positive Impact

Training supervised machine learning systems with a fairness loss can improve prediction fairness across different demographic groups. However, doing so requires demographic annotations for training data, without which we cannot produce debiased classifiers for most tasks. Drawing inspiration from transfer learning methods, we investigate whether we can utilize demographic data from a related task to improve the fairness of a target task. We adapt a single-task fairness loss to a multi-task setting to exploit demographic labels from a related task in debiasing a target task, and demonstrate that demographic fairness objectives transfer fairness within a multi-task framework. Additionally, we show that this approach enables intersectional fairness by transferring between two datasets with different single-axis demographics. We explore different data domains to show how our loss can improve fairness domains and tasks.

pdf bib abs

Selecting Shots for Demographic Fairness in Few-Shot Learning with Large Language Models
Carlos Aguirre | Kuleen Sasse | Isabel Cachola | Mark Dredze
Proceedings of the Third Workshop on NLP for Positive Impact

Recently, work in NLP has shifted to few-shot (in-context) learning, with large language models (LLMs) performing well across a range of tasks. However, while fairness evaluations have become a standard for supervised methods, little is known about the fairness of LLMs as prediction systems. Further, common standard methods for fairness involve access to model weights or are applied during finetuning, which are not applicable in few-shot learning. Do LLMs exhibit prediction biases when used for standard NLP tasks?In this work, we analyze the effect of shots, which directly affect the performance of models, on the fairness of LLMs as NLP classification systems. We consider how different shot selection strategies, both existing and new demographically sensitive methods, affect model fairness across three standard fairness datasets. We find that overall the performance of LLMs is not indicative of their fairness, and there is not a single method that fits all scenarios. In light of these facts, we discuss how future work can include LLM fairness in evaluations.

pdf bib abs

A Closer Look at Claim Decomposition
Miriam Wanner | Seth Ebner | Zhengping Jiang | Mark Dredze | Benjamin Van Durme
Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)

As generated text becomes more commonplace, it is increasingly important to evaluate how well-supported such text is by external knowledge sources. Many approaches for evaluating textual support rely on some method for decomposing text into its individual subclaims which are scored against a trusted reference. We investigate how various methods of claim decomposition—especially LLM-based methods—affect the result of an evaluation approach such as the recently proposed FActScore, finding that it is sensitive to the decomposition method used. This sensitivity arises because such metrics attribute overall textual support to the model that generated the text even though error can also come from the metric’s decomposition step. To measure decomposition quality, we introduce an adaptation of FActScore, which we call DecompScore. We then propose an LLM-based approach to generating decompositions inspired by Bertrand Russell’s theory of logical atomism and neo-Davidsonian semantics and demonstrate its improved decomposition quality over previous methods.

2023

pdf bib abs

MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies
Shiyue Zhang | Shijie Wu | Ozan Irsoy | Steven Lu | Mohit Bansal | Mark Dredze | David Rosenberg
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Autoregressive language models are trained by minimizing the cross-entropy of the model distribution Q relative to the data distribution P – that is, minimizing the forward cross-entropy, which is equivalent to maximum likelihood estimation (MLE). We have observed that models trained in this way may “over-generalize”, in the sense that they produce non-human-like text. Moreover, we believe that reverse cross-entropy, i.e., the cross-entropy of P relative to Q, is a better reflection of how a human would evaluate text generated by a model. Hence, we propose learning with MixCE, an objective that mixes the forward and reverse cross-entropies. We evaluate models trained with this objective on synthetic data settings (where P is known) and real data, and show that the resulting models yield better generated text without complex decoding strategies.

pdf bib abs

Widespread disparities in clinical outcomes exist between different demographic groups in the United States. A new line of work in medical sociology has demonstrated physicians often use stigmatizing language in electronic medical records within certain groups, such as black patients, which may exacerbate disparities. In this study, we characterize these instances at scale using a series of domain-informed NLP techniques. We highlight important differences between this task and analogous bias-related tasks studied within the NLP community (e.g., classifying microaggressions). Our study establishes a foundation for NLP researchers to contribute timely insights to a problem domain brought to the forefront by recent legislation regarding clinical documentation transparency. We release data, code, and models.

pdf bib abs

Joint End-to-end Semantic Proto-role Labeling
Elizabeth Spaulding | Gary Kazantsev | Mark Dredze
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Semantic proto-role labeling (SPRL) assigns properties to arguments based on a series of binary labels. While multiple studies have evaluated various approaches to SPRL, it has only been studied in-depth as a standalone task using gold predicate/argument pairs. How do SPRL systems perform as part of an information extraction pipeline? We model SPRL jointly with predicate-argument extraction using a deep transformer model. We find that proto-role labeling is surprisingly robust in this setting, with only a small decrease when using predicted arguments. We include a detailed analysis of each component of the joint system, and an error analysis to understand correlations in errors between system stages. Finally, we study the effects of annotation errors on SPRL.

pdf bib abs

A Multi-instance Learning Approach to Civil Unrest Event Detection on Twitter
Alexandra DeLucia | Mark Dredze | Anna L. Buczak
Proceedings of the 6th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text

Social media has become an established platform for people to organize and take offline actions, often in the form of civil unrest. Understanding these events can help support pro-democratic movements. The primary method to detect these events on Twitter relies on aggregating many tweets, but this includes many that are not relevant to the task. We propose a multi-instance learning (MIL) approach, which jointly identifies relevant tweets and detects civil unrest events. We demonstrate that MIL improves civil unrest detection over methods based on simple aggregation. Our best model achieves a 0.73 F1 on the Global Civil Unrest on Twitter (G-CUT) dataset.

pdf bib abs

Geo-Seq2seq: Twitter User Geolocation on Noisy Data through Sequence to Sequence Learning
Jingyu Zhang | Alexandra DeLucia | Chenyu Zhang | Mark Dredze
Findings of the Association for Computational Linguistics: ACL 2023

Location information can support social media analyses by providing geographic context. Some of the most accurate and popular Twitter geolocation systems rely on rule-based methods that examine the user-provided profile location, which fail to handle informal or noisy location names. We propose Geo-Seq2seq, a sequence-to-sequence (seq2seq) model for Twitter user geolocation that rewrites noisy, multilingual user-provided location strings into structured English location names. We train our system on tens of millions of multilingual location string and geotagged-tweet pairs. Compared to leading methods, our model vastly increases coverage (i.e., the number of users we can geolocate) while achieving comparable or superior accuracy. Our error analysis reveals that constrained decoding helps the model produce valid locations according to a location database. Finally, we measure biases across language, country of origin, and time to evaluate fairness, and find that while our model can generalize well to unseen temporal data, performance does vary by language and country.

pdf bib abs

On the Surprising Effectiveness of Name Matching Alone in Autoregressive Entity Linking
Elliot Schumacher | James Mayfield | Mark Dredze
Proceedings of the First Workshop on Matching From Unstructured and Structured Data (MATCHING 2023)

Fifteen years of work on entity linking has established the importance of different information sources in making linking decisions: mention and entity name similarity, contextual relevance, and features of the knowledge base. Modern state-of-the-art systems build on these features, including through neural representations (Wu et al., 2020). In contrast to this trend, the autoregressive language model GENRE (De Cao et al., 2021) generates normalized entity names for mentions and beats many other entity linking systems, despite making no use of knowledge base (KB) information. How is this possible? We analyze the behavior of GENRE on several entity linking datasets and demonstrate that its performance stems from memorization of name patterns. In contrast, it fails in cases that might benefit from using the KB. We experiment with a modification to the model to enable it to utilize KB information, highlighting challenges to incorporating traditional entity linking information sources into autoregressive models.

pdf bib abs

Strength in Numbers: Estimating Confidence of Large Language Models by Prompt Agreement
Gwenyth Portillo Wightman | Alexandra Delucia | Mark Dredze
Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)

Large language models have achieved impressive few-shot performance on a wide variety of tasks. However, in many settings, users require confidence estimates for model predictions. While traditional classifiers produce scores for each label, language models instead produce scores for the generation which may not be well calibrated. We compare generations across diverse prompts and show that these can be used to create confidence scores. By utilizing more prompts we can get more precise confidence estimates and use response diversity as a proxy for confidence. We evaluate this approach across ten multiple-choice question-answering datasets using three models: T0, FLAN-T5, and GPT-3. In addition to analyzing multiple human written prompts, we automatically generate more prompts using a language model in order to produce finer-grained confidence estimates. Our method produces more calibrated confidence estimates compared to the log probability of the answer to a single prompt. These improvements could benefit users who rely on prediction confidence for integration into a larger system or in decision-making processes.

2022

pdf bib abs

Updated Headline Generation: Creating Updated Summaries for Evolving News Stories
Sheena Panthaplackel | Adrian Benton | Mark Dredze
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We propose the task of updated headline generation, in which a system generates a headline for an updated article, considering both the previous article and headline. The system must identify the novel information in the article update, and modify the existing headline accordingly. We create data for this task using the NewsEdits corpus by automatically identifying contiguous article versions that are likely to require a substantive headline update. We find that models conditioned on the prior headline and body revisions produce headlines judged by humans to be as factual as gold headlines while making fewer unnecessary edits compared to a standard headline generation model. Our experiments establish benchmarks for this new contextual summarization task.

pdf bib abs

Model Distillation for Faithful Explanations of Medical Code Predictions
Zach Wood-Doughty | Isabel Cachola | Mark Dredze
Proceedings of the 21st Workshop on Biomedical Language Processing

Machine learning models that offer excellent predictive performance often lack the interpretability necessary to support integrated human machine decision-making. In clinical medicine and other high-risk settings, domain experts may be unwilling to trust model predictions without explanations. Work in explainable AI must balance competing objectives along two different axes: 1) Models should ideally be both accurate and simple. 2) Explanations must balance faithfulness to the model’s decision-making with their plausibility to a domain expert. We propose to use knowledge distillation, or training a student model that mimics the behavior of a trained teacher model, as a technique to generate faithful and plausible explanations. We evaluate our approach on the task of assigning ICD codes to clinical notes to demonstrate that the student model is faithful to the teacher model’s behavior and produces quality natural language explanations.

pdf bib abs

Explaining Models of Mental Health via Clinically Grounded Auxiliary Tasks
Ayah Zirikly | Mark Dredze
Proceedings of the Eighth Workshop on Computational Linguistics and Clinical Psychology

Models of mental health based on natural language processing can uncover latent signals of mental health from language. Models that indicate whether an individual is depressed, or has other mental health conditions, can aid in diagnosis and treatment. A critical aspect of integration of these models into the clinical setting relies on explaining their behavior to domain experts. In the case of mental health diagnosis, clinicians already rely on an assessment framework to make these decisions; that framework can help a model generate meaningful explanations. In this work we propose to use PHQ-9 categories as an auxiliary task to explaining a social media based model of depression. We develop a multi-task learning framework that predicts both depression and PHQ-9 categories as auxiliary tasks. We compare the quality of explanations generated based on the depression task only, versus those that use the predicted PHQ-9 categories. We find that by relying on clinically meaningful auxiliary tasks, we produce more meaningful explanations.

pdf bib abs

Then and Now: Quantifying the Longitudinal Validity of Self-Disclosed Depression Diagnoses
Keith Harrigian | Mark Dredze
Proceedings of the Eighth Workshop on Computational Linguistics and Clinical Psychology

Self-disclosed mental health diagnoses, which serve as ground truth annotations of mental health status in the absence of clinical measures, underpin the conclusions behind most computational studies of mental health language from the last decade. However, psychiatric conditions are dynamic; a prior depression diagnosis may no longer be indicative of an individual’s mental health, either due to treatment or other mitigating factors. We ask: to what extent are self-disclosures of mental health diagnoses actually relevant over time? We analyze recent activity from individuals who disclosed a depression diagnosis on social media over five years ago and, in turn, acquire a new understanding of how presentations of mental health status on social media manifest longitudinally. We also provide expanded evidence for the presence of personality-related biases in datasets curated using self-disclosed diagnoses. Our findings motivate three practical recommendations for improving mental health datasets curated using self-disclosed diagnoses:1) Annotate diagnosis dates and psychiatric comorbidities2) Sample control groups using propensity score matching3) Identify and remove spurious correlations introduced by selection bias

pdf bib abs

The language of Twitter differs significantly from that of other domains commonly included in large language model training. While tweets are typically multilingual and contain informal language, including emoji and hashtags, most pre-trained language models for Twitter are either monolingual, adapted from other domains rather than trained exclusively on Twitter, or are trained on a limited amount of in-domain Twitter data.We introduce Bernice, the first multilingual RoBERTa language model trained from scratch on 2.5 billion tweets with a custom tweet-focused tokenizer. We evaluate on a variety of monolingual and multilingual Twitter benchmarks, finding that our model consistently exceeds or matches the performance of a variety of models adapted to social media data as well as strong multilingual baselines, despite being trained on less data overall.We posit that it is more efficient compute- and data-wise to train completely on in-domain data with a specialized domain-specific tokenizer.

pdf bib abs

Do Text-to-Text Multi-Task Learners Suffer from Task Conflict?
David Mueller | Nicholas Andrews | Mark Dredze
Findings of the Association for Computational Linguistics: EMNLP 2022

Traditional multi-task learning architectures learn a single model across multiple tasks through a shared encoder followed by task-specific decoders. Learning these models often requires specialized training algorithms that address task-conflict in the shared parameter updates, which otherwise can lead to negative transfer. A new type of multi-task learning within NLP homogenizes multi-task architectures as a shared encoder and language model decoder, which does surprisingly well across a range of diverse tasks. Does this new architecture suffer from task-conflicts that require specialized training algorithms? We study how certain factors in the shift towards text-to-text models affects multi-task conflict and negative transfer, finding that both directional conflict and transfer are surprisingly constant across architectures.

pdf bib abs

What Makes Data-to-Text Generation Hard for Pretrained Language Models?
Moniba Keymanesh | Adrian Benton | Mark Dredze
Proceedings of the Second Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Expressing natural language descriptions of structured facts or relations – data-to-text generation (D2T) – increases the accessibility of structured knowledge repositories. Previous work shows that pre-trained language models (PLMs) perform remarkably well on this task after fine-tuning on a significant amount of task-specific training data. On the other hand, while auto-regressive PLMs can generalize from a few task examples, their efficacy at D2T is largely unexplored. Furthermore, we have an incomplete understanding of the limits of PLMs on D2T. In this work, we conduct an empirical study of both fine-tuned and auto-regressive PLMs on the DART multi-domain D2T dataset. We consider their performance as a function of the amount of task-specific data and how the data is incorporated into the models: zero and few-shot learning, and fine-tuning of model weights. In addition, we probe the limits of PLMs by measuring performance on subsets of the evaluation data: novel predicates and abstractive test examples. To improve the performance on these subsets, we investigate two techniques: providing predicate descriptions in the context and re-ranking generated candidates by information reflected in the source. Finally, we conduct a human evaluation of model errors and show that D2T generation tasks would benefit from datasets with more careful manual curation.

pdf bib abs

Zero-shot Cross-Language Transfer of Monolingual Entity Linking Models
Elliot Schumacher | James Mayfield | Mark Dredze
Proceedings of the 2nd Workshop on Multi-lingual Representation Learning (MRL)

Most entity linking systems, whether mono or multilingual, link mentions to a single English knowledge base. Few have considered linking non-English text to a non-English KB, and therefore, transferring an English entity linking model to both a new document and KB language. We consider the task of zero-shot cross-language transfer of entity linking systems to a new language and KB. We find that a system trained with multilingual representations does reasonably well, and propose improvements to system training that lead to improved recall in most datasets, often matching the in-language performance. We further conduct a detailed evaluation to elucidate the challenges of this setting.

pdf bib abs

Zero-shot Cross-lingual Transfer is Under-specified Optimization
Shijie Wu | Benjamin Van Durme | Mark Dredze
Proceedings of the 7th Workshop on Representation Learning for NLP

Pretrained multilingual encoders enable zero-shot cross-lingual transfer, but often produce unreliable models that exhibit high performance variance on the target language. We postulate that this high variance results from zero-shot cross-lingual transfer solving an under-specified optimization problem. We show that any linear-interpolated model between the source language monolingual model and source + target bilingual model has equally low source language generalization error, yet the target language generalization error reduces smoothly and linearly as we move from the monolingual to bilingual model, suggesting that the model struggles to identify good solutions for both source and target languages using the source language alone. Additionally, we show that zero-shot solution lies in non-flat region of target language error generalization surface, causing the high variance.

pdf bib abs

Changes in Tweet Geolocation over Time: A Study with Carmen 2.0
Jingyu Zhang | Alexandra DeLucia | Mark Dredze
Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022)

Researchers across disciplines use Twitter geolocation tools to filter data for desired locations. These tools have largely been trained and tested on English tweets, often originating in the United States from almost a decade ago. Despite the importance of these tools for data curation, the impact of tweet language, country of origin, and creation date on tool performance remains largely unknown. We explore these issues with Carmen, a popular tool for Twitter geolocation. To support this study we introduce Carmen 2.0, a major update which includes the incorporation of GeoNames, a gazetteer that provides much broader coverage of locations. We evaluate using two new Twitter datasets, one for multilingual, multiyear geolocation evaluation, and another for usage trends over time. We found that language, country origin, and time does impact geolocation tool performance.

2021

pdf bib abs

User Factor Adaptation for User Embedding via Multitask Learning
Xiaolei Huang | Michael J. Paul | Franck Dernoncourt | Robin Burke | Mark Dredze
Proceedings of the Second Workshop on Domain Adaptation for NLP

Language varies across users and their interested fields in social media data: words authored by a user across his/her interests may have different meanings (e.g., cool) or sentiments (e.g., fast). However, most of the existing methods to train user embeddings ignore the variations across user interests, such as product and movie categories (e.g., drama vs. action). In this study, we treat the user interest as domains and empirically examine how the user language can vary across the user factor in three English social media datasets. We then propose a user embedding model to account for the language variability of user interests via a multitask learning framework. The model learns user language and its variations without human supervision. While existing work mainly evaluated the user embedding by extrinsic tasks, we propose an intrinsic evaluation via clustering and evaluate user embeddings by an extrinsic task, text classification. The experiments on the three English-language social media datasets show that our proposed approach can generally outperform baselines via adapting the user factor.

pdf bib abs

On the State of Social Media Data for Mental Health Research
Keith Harrigian | Carlos Aguirre | Mark Dredze
Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access

Data-driven methods for mental health treatment and surveillance have become a major focus in computational science research in the last decade. However, progress in the domain remains bounded by the availability of adequate data. Prior systematic reviews have not necessarily made it possible to measure the degree to which data-related challenges have affected research progress. In this paper, we offer an analysis specifically on the state of social media data that exists for conducting mental health research. We do so by introducing an open-source directory of mental health datasets, annotated using a standardized schema to facilitate meta-analysis.

pdf bib abs

Qualitative Analysis of Depression Models by Demographics
Carlos Aguirre | Mark Dredze
Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access

Models for identifying depression using social media text exhibit biases towards different gender and racial/ethnic groups. Factors like representation and balance of groups within the dataset are contributory factors, but difference in content and social media use may further explain these biases. We present an analysis of the content of social media posts from different demographic groups. Our analysis shows that there are content differences between depression and control subgroups across demographic groups, and that temporal topics and demographic-specific topics are correlated with downstream depression model error. We discuss the implications of our work on creating future datasets, as well as designing and training models for mental health.

pdf bib abs

Towards Understanding the Role of Gender in Deploying Social Media-Based Mental Health Surveillance Models
Eli Sherman | Keith Harrigian | Carlos Aguirre | Mark Dredze
Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access

Spurred by advances in machine learning and natural language processing, developing social media-based mental health surveillance models has received substantial recent attention. For these models to be maximally useful, it is necessary to understand how they perform on various subgroups, especially those defined in terms of protected characteristics. In this paper we study the relationship between user demographics – focusing on gender – and depression. Considering a population of Reddit users with known genders and depression statuses, we analyze the degree to which depression predictions are subject to biases along gender lines using domain-informed classifiers. We then study our models’ parameters to gain qualitative insight into the differences in posting behavior across genders.

pdf bib abs

Gender and Racial Fairness in Depression Research using Social Media
Carlos Aguirre | Keith Harrigian | Mark Dredze
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Multiple studies have demonstrated that behaviors expressed on online social media platforms can indicate the mental health state of an individual. The widespread availability of such data has spurred interest in mental health research, using several datasets where individuals are labeled with mental health conditions. While previous research has raised concerns about possible biases in models produced from this data, no study has investigated how these biases manifest themselves with regards to demographic groups in data, such as gender and racial/ethnic groups. Here, we analyze the fairness of depression classifiers trained on Twitter data with respect to gender and racial demographic groups. We find that model performance differs for underrepresented groups, and we investigate sources of these biases beyond data representation. Our study results in recommendations on how to avoid these biases in future research.

pdf bib abs

Zero-shot cross-lingual information extraction (IE) describes the construction of an IE model for some target language, given existing annotations exclusively in some other language, typically English. While the advance of pretrained multilingual encoders suggests an easy optimism of “train on English, run on any language”, we find through a thorough exploration and extension of techniques that a combination of approaches, both new and old, leads to better performance than any one cross-lingual strategy in particular. We explore techniques including data projection and self-training, and how different pretrained encoders impact them. We use English-to-Arabic IE as our initial example, demonstrating strong performance in this setting for event extraction, named entity recognition, part-of-speech tagging, and dependency parsing. We then apply data projection and self-training to three tasks across eight target languages. Because no single set of techniques performs the best across all tasks, we encourage practitioners to explore various configurations of the techniques described in this work when seeking to improve on zero-shot training.

pdf bib

Cross-Lingual Transfer in Zero-Shot Cross-Language Entity Linking
Elliot Schumacher | James Mayfield | Mark Dredze
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib abs

Fine-tuning Encoders for Improved Monolingual and Zero-shot Polylingual Neural Topic Modeling
Aaron Mueller | Mark Dredze
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Neural topic models can augment or replace bag-of-words inputs with the learned representations of deep pre-trained transformer-based word prediction models. One added benefit when using representations from multilingual models is that they facilitate zero-shot polylingual topic modeling. However, while it has been widely observed that pre-trained embeddings should be fine-tuned to a given task, it is not immediately clear what supervision should look like for an unsupervised task such as topic modeling. Thus, we propose several methods for fine-tuning encoders to improve both monolingual and zero-shot polylingual neural topic modeling. We consider fine-tuning on auxiliary tasks, constructing a new topic classification task, integrating the topic classification objective directly into topic model training, and continued pre-training. We find that fine-tuning encoder representations on topic classification and integrating the topic classification task directly into topic modeling improves topic quality, and that fine-tuning encoder representations on any task is the most important factor for facilitating cross-lingual transfer.

pdf bib abs

Using Noisy Self-Reports to Predict Twitter User Demographics
Zach Wood-Doughty | Paiheng Xu | Xiao Liu | Mark Dredze
Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media

Computational social science studies often contextualize content analysis within standard demographics. Since demographics are unavailable on many social media platforms (e.g. Twitter), numerous studies have inferred demographics automatically. Despite many studies presenting proof-of-concept inference of race and ethnicity, training of practical systems remains elusive since there are few annotated datasets. Existing datasets are small, inaccurate, or fail to cover the four most common racial and ethnic groups in the United States. We present a method to identify self-reports of race and ethnicity from Twitter profile descriptions. Despite the noise of automated supervision, our self-report datasets enable improvements in classification performance on gold standard self-report survey data. The result is a reproducible method for creating large-scale training resources for race and ethnicity.

pdf bib abs

Study of Manifestation of Civil Unrest on Twitter
Abhinav Chinta | Jingyu Zhang | Alexandra DeLucia | Mark Dredze | Anna L. Buczak
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

Twitter is commonly used for civil unrest detection and forecasting tasks, but there is a lack of work in evaluating how civil unrest manifests on Twitter across countries and events. We present two in-depth case studies for two specific large-scale events, one in a country with high (English) Twitter usage (Johannesburg riots in South Africa) and one in a country with low Twitter usage (Burayu massacre protests in Ethiopia). We show that while there is event signal during the events, there is little signal leading up to the events. In addition to the case studies, we train Ngram-based models on a larger set of Twitter civil unrest data across time, events, and countries and use machine learning explainability tools (SHAP) to identify important features. The models were able to find words indicative of civil unrest that generalized across countries. The 42 countries span Africa, Middle East, and Southeast Asia and the events range occur between 2014 and 2019.

2020

pdf bib abs

Sources of Transfer in Multilingual Named Entity Recognition
David Mueller | Nicholas Andrews | Mark Dredze
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Named-entities are inherently multilingual, and annotations in any given language may be limited. This motivates us to consider polyglot named-entity recognition (NER), where one model is trained using annotated data drawn from more than one language. However, a straightforward implementation of this simple idea does not always work in practice: naive training of NER models using annotated data drawn from multiple languages consistently underperforms models trained on monolingual data alone, despite having access to more training data. The starting point of this paper is a simple solution to this problem, in which polyglot models are fine-tuned on monolingual data to consistently and significantly outperform their monolingual counterparts. To explain this phenomena, we explore the sources of multilingual transfer in polyglot NER models and examine the weight structure of polyglot models compared to their monolingual counterparts. We find that polyglot models efficiently share many parameters across languages and that fine-tuning may utilize a large number of those parameters.

pdf bib abs

Clinical Concept Linking with Contextualized Neural Representations
Elliot Schumacher | Andriy Mulyar | Mark Dredze
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In traditional approaches to entity linking, linking decisions are based on three sources of information – the similarity of the mention string to an entity’s name, the similarity of the context of the document to the entity, and broader information about the knowledge base (KB). In some domains, there is little contextual information present in the KB and thus we rely more heavily on mention string similarity. We consider one example of this, concept linking, which seeks to link mentions of medical concepts to a medical concept ontology. We propose an approach to concept linking that leverages recent work in contextualized neural models, such as ELMo (Peters et al. 2018), which create a token representation that integrates the surrounding context of the mention and concept name. We find a neural ranking approach paired with contextualized embeddings provides gains over a competitive baseline (Leaman et al. 2013). Additionally, we find that a pre-training step using synonyms from the ontology offers a useful initialization for the ranker.

pdf bib abs

Do Explicit Alignments Robustly Improve Multilingual Encoders?
Shijie Wu | Mark Dredze
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Multilingual BERT (mBERT), XLM-RoBERTa (XLMR) and other unsupervised multilingual encoders can effectively learn cross-lingual representation. Explicit alignment objectives based on bitexts like Europarl or MultiUN have been shown to further improve these representations. However, word-level alignments are often suboptimal and such bitexts are unavailable for many languages. In this paper, we propose a new contrastive alignment objective that can better utilize such signal, and examine whether these previous alignment methods can be adapted to noisier sources of aligned data: a randomly sampled 1 million pair subset of the OPUS collection. Additionally, rather than report results on a single dataset with a single model run, we report the mean and standard derivation of multiple runs with different seeds, on four datasets and tasks. Our more extensive analysis finds that, while our new objective outperforms previous work, overall these methods do not improve performance with a more robust evaluation framework. Furthermore, the gains from using a better underlying model eclipse any benefits from alignment training. These negative results dictate more care in evaluating these methods and suggest limitations in applying explicit alignment objectives.

pdf bib abs

Do Models of Mental Health Based on Social Media Data Generalize?
Keith Harrigian | Carlos Aguirre | Mark Dredze
Findings of the Association for Computational Linguistics: EMNLP 2020

Proxy-based methods for annotating mental health status in social media have grown popular in computational research due to their ability to gather large training samples. However, an emerging body of literature has raised new concerns regarding the validity of these types of methods for use in clinical applications. To further understand the robustness of distantly supervised mental health models, we explore the generalization ability of machine learning classifiers trained to detect depression in individuals across multiple social media platforms. Our experiments not only reveal that substantial loss occurs when transferring between platforms, but also that there exist several unreliable confounding factors that may enable researchers to overestimate classification performance. Based on these results, we enumerate recommendations for future mental health dataset construction.

pdf bib

pdf bib

pdf bib abs

Are All Languages Created Equal in Multilingual BERT?
Shijie Wu | Mark Dredze
Proceedings of the 5th Workshop on Representation Learning for NLP

Multilingual BERT (mBERT) trained on 104 languages has shown surprisingly good cross-lingual performance on several NLP tasks, even without explicit cross-lingual signals. However, these evaluations have focused on cross-lingual transfer with high-resource languages, covering only a third of the languages covered by mBERT. We explore how mBERT performs on a much wider set of languages, focusing on the quality of representation for low-resource languages, measured by within-language performance. We consider three tasks: Named Entity Recognition (99 languages), Part-of-speech Tagging and Dependency Parsing (54 languages each). mBERT does better than or comparable to baselines on high resource languages but does much worse for low resource languages. Furthermore, monolingual BERT models for these languages do even worse. Paired with similar languages, the performance gap between monolingual BERT and mBERT can be narrowed. We find that better models for low resource languages require more efficient pretraining techniques or more data.

pdf bib abs

Civil Unrest on Twitter (CUT): A Dataset of Tweets to Support Research on Civil Unrest
Justin Sech | Alexandra DeLucia | Anna L. Buczak | Mark Dredze
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

We present CUT, a dataset for studying Civil Unrest on Twitter. Our dataset includes 4,381 tweets related to civil unrest, hand-annotated with information related to the study of civil unrest discussion and events. Our dataset is drawn from 42 countries from 2014 to 2019. We present baseline systems trained on this data for the identification of tweets related to civil unrest. We include a discussion of ethical issues related to research on this topic.

2019

pdf bib abs

Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT
Shijie Wu | Mark Dredze
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Pretrained contextual representation models (Peters et al., 2018; Devlin et al., 2018) have pushed forward the state-of-the-art on many NLP tasks. A new release of BERT (Devlin, 2018) includes a model simultaneously pretrained on 104 languages with impressive performance for zero-shot cross-lingual transfer on a natural language inference task. This paper explores the broader cross-lingual potential of mBERT (multilingual) as a zero shot language transfer model on 5 NLP tasks covering a total of 39 languages from various language families: NLI, document classification, NER, POS tagging, and dependency parsing. We compare mBERT with the best-published methods for zero-shot cross-lingual transfer and find mBERT competitive on each task. Additionally, we investigate the most effective strategy for utilizing mBERT in this manner, determine to what extent mBERT generalizes away from language specific features, and measure factors that influence cross-lingual transfer.

pdf bib abs

Mental Health Surveillance over Social Media with Digital Cohorts
Silvio Amir | Mark Dredze | John W. Ayers
Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology

The ability to track mental health conditions via social media opened the doors for large-scale, automated, mental health surveillance. However, inferring accurate population-level trends requires representative samples of the underlying population, which can be challenging given the biases inherent in social media data. While previous work has adjusted samples based on demographic estimates, the populations were selected based on specific outcomes, e.g. specific mental health conditions. We depart from these methods, by conducting analyses over demographically representative digital cohorts of social media users. To validated this approach, we constructed a cohort of US based Twitter users to measure the prevalence of depression and PTSD, and investigate how these illnesses manifest across demographic subpopulations. The analysis demonstrates that cohort-based studies can help control for sampling biases, contextualize outcomes, and provide deeper insights into the data.

2018

pdf bib abs

Challenges of Using Text Classifiers for Causal Inference
Zach Wood-Doughty | Ilya Shpitser | Mark Dredze
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Causal understanding is essential for many kinds of decision-making, but causal inference from observational data has typically only been applied to structured, low-dimensional datasets. While text classifiers produce low-dimensional outputs, their use in causal inference has not previously been studied. To facilitate causal analyses based on language data, we consider the role that text classifiers can play in causal inference through established modeling mechanisms from the causality literature on missing data and measurement error. We demonstrate how to conduct causal analyses using text classifiers on simulated and Yelp data, and discuss the opportunities and challenges of future work that uses text data in causal inference.

pdf bib abs

Deep Dirichlet Multinomial Regression
Adrian Benton | Mark Dredze
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Dirichlet Multinomial Regression (DMR) and other supervised topic models can incorporate arbitrary document-level features to inform topic priors. However, their ability to model corpora are limited by the representation and selection of these features – a choice the topic modeler must make. Instead, we seek models that can learn the feature representations upon which to condition topic selection. We present deep Dirichlet Multinomial Regression (dDMR), a generative topic model that simultaneously learns document feature representations and topics. We evaluate dDMR on three datasets: New York Times articles with fine-grained tags, Amazon product reviews with product images, and Reddit posts with subreddit identity. dDMR learns representations that outperform DMR and LDA according to heldout perplexity and are more effective at downstream predictive tasks as the number of topics grows. Additionally, human subjects judge dDMR topics as being more representative of associated document features. Finally, we find that supervision leads to faster convergence as compared to an LDA baseline and that dDMR’s model fit is less sensitive to training parameters than DMR.

pdf bib abs

Johns Hopkins or johnny-hopkins: Classifying Individuals versus Organizations on Twitter
Zach Wood-Doughty | Praateek Mahajan | Mark Dredze
Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media

Twitter user accounts include a range of different user types. While many individuals use Twitter, organizations also have Twitter accounts. Identifying opinions and trends from Twitter requires the accurate differentiation of these two groups. Previous work (McCorriston et al., 2015) presented a method for determining if an account was an individual or organization based on account profile and a collection of tweets. We present a method that relies solely on the account profile, allowing for the classification of individuals versus organizations based on a single tweet. Our method obtains accuracies comparable to methods that rely on much more information by leveraging two improvements: a character-based Convolutional Neural Network, and an automatically derived labeled corpus an order of magnitude larger than the previously available dataset. We make both the dataset and the resulting tool available.

pdf bib abs

Predicting Twitter User Demographics from Names Alone
Zach Wood-Doughty | Nicholas Andrews | Rebecca Marvin | Mark Dredze
Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media

Social media analysis frequently requires tools that can automatically infer demographics to contextualize trends. These tools often require hundreds of user-authored messages for each user, which may be prohibitive to obtain when analyzing millions of users. We explore character-level neural models that learn a representation of a user’s name and screen name to predict gender and ethnicity, allowing for demographic inference with minimal data. We release trained models1 which may enable new demographic analyses that would otherwise require enormous amounts of data collection

pdf bib abs

Using Author Embeddings to Improve Tweet Stance Classification
Adrian Benton | Mark Dredze
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text

Many social media classification tasks analyze the content of a message, but do not consider the context of the message. For example, in tweet stance classification – where a tweet is categorized according to a viewpoint it espouses – the expressed viewpoint depends on latent beliefs held by the user. In this paper we investigate whether incorporating knowledge about the author can improve tweet stance classification. Furthermore, since author information and embeddings are often unavailable for labeled training examples, we propose a semi-supervised pretraining method to predict user embeddings. Although the neural stance classifiers we learn are often outperformed by a baseline SVM, author embedding pre-training yields improvements over a non-pre-trained neural network on four out of five domains in the SemEval 2016 6A tweet stance classification task. In a tweet gun control stance classification dataset, improvements from pre-training are only apparent when training data is limited.

pdf bib abs

Convolutions Are All You Need (For Classifying Character Sequences)
Zach Wood-Doughty | Nicholas Andrews | Mark Dredze
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text

While recurrent neural networks (RNNs) are widely used for text classification, they demonstrate poor performance and slow convergence when trained on long sequences. When text is modeled as characters instead of words, the longer sequences make RNNs a poor choice. Convolutional neural networks (CNNs), although somewhat less ubiquitous than RNNs, have an internal structure more appropriate for long-distance character dependencies. To better understand how CNNs and RNNs differ in handling long sequences, we use them for text classification tasks in several character-level social media datasets. The CNN models vastly outperform the RNN models in our experiments, suggesting that CNNs are superior to RNNs at learning to classify character-level data.

2017

Computer Assisted Discovery Extraction and Translation (CADET) is a workbench for helping knowledge workers find, label, and translate documents of interest. It combines a multitude of analytics together with a flexible environment for customizing the workflow for different users. This open-source framework allows for easy development of new research prototypes using a micro-service architecture based atop Docker and Apache Thrift.

pdf bib abs

Bayesian Modeling of Lexical Resources for Low-Resource Settings
Nicholas Andrews | Mark Dredze | Benjamin Van Durme | Jason Eisner
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Lexical resources such as dictionaries and gazetteers are often used as auxiliary data for tasks such as part-of-speech induction and named-entity recognition. However, discriminative training with lexical features requires annotated data to reliably estimate the lexical feature weights and may result in overfitting the lexical features at the expense of features which generalize better. In this paper, we investigate a more robust approach: we stipulate that the lexicon is the result of an assumed generative process. Practically, this means that we may treat the lexical resources as observations under the proposed generative model. The lexical resources provide training data for the generative model without requiring separate data to estimate lexical feature weights. We evaluate the proposed approach in two settings: part-of-speech induction and low-resource named-entity recognition.

pdf bib abs

Pocket Knowledge Base Population
Travis Wolfe | Mark Dredze | Benjamin Van Durme
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Existing Knowledge Base Population methods extract relations from a closed relational schema with limited coverage leading to sparse KBs. We propose Pocket Knowledge Base Population (PKBP), the task of dynamically constructing a KB of entities related to a query and finding the best characterization of relationships between entities. We describe novel Open Information Extraction methods which leverage the PKB to find informative trigger words. We evaluate using existing KBP shared-task data as well anew annotations collected for this work. Our methods produce high quality KB from just text with many more entities and relationships than existing KBP systems.

pdf bib

pdf bib abs

Ethical Research Protocols for Social Media Health Research
Adrian Benton | Glen Coppersmith | Mark Dredze
Proceedings of the First ACL Workshop on Ethics in Natural Language Processing

Social media have transformed data-driven research in political science, the social sciences, health, and medicine. Since health research often touches on sensitive topics that relate to ethics of treatment and patient privacy, similar ethical considerations should be acknowledged when using social media data in health research. While much has been said regarding the ethical considerations of social media research, health research leads to an additional set of concerns. We provide practical suggestions in the form of guidelines for researchers working with social media data in health research. These guidelines can inform an IRB proposal for researchers new to social media health research.

pdf bib abs

Multi-task Domain Adaptation for Sequence Tagging
Nanyun Peng | Mark Dredze
Proceedings of the 2nd Workshop on Representation Learning for NLP

Many domain adaptation approaches rely on learning cross domain shared representations to transfer the knowledge learned in one domain to other domains. Traditional domain adaptation only considers adapting for one task. In this paper, we explore multi-task representation learning under the domain adaptation scenario. We propose a neural network framework that supports domain adaptation for multiple tasks simultaneously, and learns shared representations that better generalize for domain adaptation. We apply the proposed framework to domain adaptation for sequence tagging problems considering two tasks: Chinese word segmentation and named entity recognition. Experiments show that multi-task domain adaptation works better than disjoint domain adaptation for each task, and achieves the state-of-the-art results for both tasks in the social media domain.

pdf bib abs

How Does Twitter User Behavior Vary Across Demographic Groups?
Zach Wood-Doughty | Michael Smith | David Broniatowski | Mark Dredze
Proceedings of the Second Workshop on NLP and Computational Social Science

Demographically-tagged social media messages are a common source of data for computational social science. While these messages can indicate differences in beliefs and behaviors between demographic groups, we do not have a clear understanding of how different demographic groups use platforms such as Twitter. This paper presents a preliminary analysis of how groups’ differing behaviors may confound analyses of the groups themselves. We analyzed one million Twitter users by first inferring demographic attributes, and then measuring several indicators of Twitter behavior. We find differences in these indicators across demographic groups, suggesting that there may be underlying differences in how different demographic groups use Twitter.

pdf bib abs

Constructing an Alias List for Named Entities during an Event
Anietie Andy | Mark Dredze | Mugizi Rwebangira | Chris Callison-Burch
Proceedings of the 3rd Workshop on Noisy User-generated Text

In certain fields, real-time knowledge from events can help in making informed decisions. In order to extract pertinent real-time knowledge related to an event, it is important to identify the named entities and their corresponding aliases related to the event. The problem of identifying aliases of named entities that spike has remained unexplored. In this paper, we introduce an algorithm, EntitySpike, that identifies entities that spike in popularity in tweets from a given time period, and constructs an alias list for these spiked entities. EntitySpike uses a temporal heuristic to identify named entities with similar context that occur in the same time period (within minutes) during an event. Each entity is encoded as a vector using this temporal heuristic. We show how these entity-vectors can be used to create a named entity alias list. We evaluated our algorithm on a dataset of temporally ordered tweets from a single event, the 2013 Grammy Awards show. We carried out various experiments on tweets that were published in the same time period and show that our algorithm identifies most entity name aliases and outperforms a competitive baseline.

2016

pdf bib

Embedding Lexical Features via Low-Rank Tensors
Mo Yu | Mark Dredze | Raman Arora | Matthew R. Gormley
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib

Geolocation for Twitter: Timing Matters
Mark Dredze | Miles Osborne | Prabhanjan Kambadur
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib

Learning Multiview Embeddings of Twitter Users
Adrian Benton | Raman Arora | Mark Dredze
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib

Improving Named Entity Recognition for Chinese Social Media with Word Segmentation Representation Learning
Nanyun Peng | Mark Dredze
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib

Knowledge Base Population for Organization Mentions in Email
Ning Gao | Mark Dredze | Douglas Oard
Proceedings of the 5th Workshop on Automated Knowledge Base Construction

pdf bib abs

Name Variation in Community Question Answering Systems
Anietie Andy | Satoshi Sekine | Mugizi Rwebangira | Mark Dredze
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

Name Variation in Community Question Answering Systems Abstract Community question answering systems are forums where users can ask and answer questions in various categories. Examples are Yahoo! Answers, Quora, and Stack Overflow. A common challenge with such systems is that a significant percentage of asked questions are left unanswered. In this paper, we propose an algorithm to reduce the number of unanswered questions in Yahoo! Answers by reusing the answer to the most similar past resolved question to the unanswered question, from the site. Semantically similar questions could be worded differently, thereby making it difficult to find questions that have shared needs. For example, “Who is the best player for the Reds?” and “Who is currently the biggest star at Manchester United?” have a shared need but are worded differently; also, “Reds” and “Manchester United” are used to refer to the soccer team Manchester United football club. In this research, we focus on question categories that contain a large number of named entities and entity name variations. We show that in these categories, entity linking can be used to identify relevant past resolved questions with shared needs as a given question by disambiguating named entities and matching these questions based on the disambiguated entities, identified entities, and knowledge base information related to these entities. We evaluated our algorithm on a new dataset constructed from Yahoo! Answers. The dataset contains annotated question pairs, (Qgiven, [Qpast, Answer]). We carried out experiments on several question categories and show that an entity-based approach gives good performance when searching for similar questions in entity rich categories.

pdf bib

Demographer: Extremely Simple Name Demographics
Rebecca Knowles | Josh Carroll | Mark Dredze
Proceedings of the First Workshop on NLP and Computational Social Science

pdf bib

A Study of Imitation Learning Methods for Semantic Role Labeling
Travis Wolfe | Mark Dredze | Benjamin Van Durme
Proceedings of the Workshop on Structured Prediction for NLP

pdf bib

Twitter at the Grammys: A Social Media Corpus for Entity Linking and Disambiguation
Mark Dredze | Nicholas Andrews | Jay DeYoung
Proceedings of the Fourth International Workshop on Natural Language Processing for Social Media

2015

pdf bib

Named Entity Recognition for Chinese Social Media with Jointly Trained Embeddings
Nanyun Peng | Mark Dredze
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib

Improved Relation Extraction with Feature-Rich Compositional Embedding Models
Matthew R. Gormley | Mo Yu | Mark Dredze
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib

Predicate Argument Alignment using a Global Coherence Model
Travis Wolfe | Mark Dredze | Benjamin Van Durme
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib

Entity Linking for Spoken Language
Adrian Benton | Mark Dredze
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib

Combining Word Embeddings and Feature Embeddings for Fine-grained Relation Extraction
Mo Yu | Matthew R. Gormley | Mark Dredze
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib

pdf bib

An Empirical Study of Chinese Name Matching and Applications
Nanyun Peng | Mo Yu | Mark Dredze
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib

FrameNet+: Fast Paraphrastic Tripling of FrameNet
Ellie Pavlick | Travis Wolfe | Pushpendre Rastogi | Chris Callison-Burch | Mark Dredze | Benjamin Van Durme
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib abs

Sprite: Generalizing Topic Models with Structured Priors
Michael J. Paul | Mark Dredze
Transactions of the Association for Computational Linguistics, Volume 3

We introduce Sprite, a family of topic models that incorporates structure into model priors as a function of underlying components. The structured priors can be constrained to model topic hierarchies, factorizations, correlations, and supervision, allowing Sprite to be tailored to particular settings. We demonstrate this flexibility by constructing a Sprite-based model to jointly infer topic hierarchies and author perspective, which we apply to corpora of political debates and online reviews. We show that the model learns intuitive topics, outperforming several other topic models at predictive tasks.

pdf bib abs

Learning Composition Models for Phrase Embeddings
Mo Yu | Mark Dredze
Transactions of the Association for Computational Linguistics, Volume 3

Lexical embeddings can serve as useful representations for words for a variety of NLP tasks, but learning embeddings for phrases can be challenging. While separate embeddings are learned for each word, this is infeasible for every phrase. We construct phrase embeddings by learning how to compose word embeddings using features that capture phrase structure and context. We propose efficient unsupervised and task-specific learning objectives that scale our model to large datasets. We demonstrate improvements on both language modeling and several phrase semantic similarity tasks with various phrase lengths. We make the implementation of our model and the datasets available for general use.

pdf bib abs

Approximation-Aware Dependency Parsing by Belief Propagation
Matthew R. Gormley | Mark Dredze | Jason Eisner
Transactions of the Association for Computational Linguistics, Volume 3

We show how to train the fast dependency parser of Smith and Eisner (2008) for improved accuracy. This parser can consider higher-order interactions among edges while retaining O(n3) runtime. It outputs the parse with maximum expected recall—but for speed, this expectation is taken under a posterior distribution that is constructed only approximately, using loopy belief propagation through structured factors. We show how to adjust the model parameters to compensate for the errors introduced by this approximation, by following the gradient of the actual loss on training data. We find this gradient by back-propagation. That is, we treat the entire parser (approximations and all) as a differentiable circuit, as others have done for loopy CRFs (Domke, 2010; Stoyanov et al., 2011; Domke, 2011; Stoyanov and Eisner, 2012). The resulting parser obtains higher accuracy with fewer iterations of belief propagation than one trained by conditional log-likelihood.

pdf bib

From ADHD to SAD: Analyzing the Language of Mental Health on Twitter through Self-Reported Diagnoses
Glen Coppersmith | Mark Dredze | Craig Harman | Kristy Hollingshead
Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality

pdf bib

CLPsych 2015 Shared Task: Depression and PTSD on Twitter
Glen Coppersmith | Mark Dredze | Craig Harman | Kristy Hollingshead | Margaret Mitchell
Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality