Frank Rudzicz - ACL Anthology

Frank Rudzicz

2026

When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation
Abeer Badawi | Elahe Rahimi | Md Tahmid Rahman Laskar | Sheri Grach | Lindsay Bertrand | Lames Danok | Prathiba Dhanesh | Jimmy Huang | Frank Rudzicz | Elham Dolatabadi
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Evaluating Large Language Models (LLMs) for mental health support poses unique challenges to reliable evaluation due to the emotionally and cognitively complex nature of therapeutic dialogue. Existing benchmarks are limited in scale, authenticity, and reliability, often relying on synthetic or social media data, and lack frameworks to assess when automated judges can be trusted. To address the need for large-scale authentic dialogue datasets and judge-reliability assessment, we introduce two benchmarks that provide a framework for generation and evaluation in this domain. MentalBench-100k consolidates 10,000 authentic single-session therapeutic conversations from three real-world scenarios datasets, each paired with nine LLM-generated responses, yielding 100,000 response pairs. MentalAlign-70k reframes evaluation by comparing four high-performing LLM judges with human experts across 70,000 ratings on seven attributes, grouped into Cognitive Support Score (CSS) and Affective Resonance Score (ARS). We then employ the Affective–Cognitive Agreement Framework, a statistical methodology using intraclass correlation coefficients (ICC) with confidence intervals to quantify agreement, consistency, and bias between LLM judges and human experts. Our analysis reveals systematic inflation by LLM judges, strong reliability for cognitive attributes such as guidance and informativeness, reduced precision for empathy, and some unreliability in safety and relevance. Our contributions establish new methodological and empirical foundations for the reliable and large-scale evaluation of LLMs in mental health contexts.

Is This LLM Library Learning? Evaluation Must Account For Compute and Behaviour
Ian Berlot-Attwell | Tobias Sesterhenn | Frank Rudzicz | Xujie Si
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

The in-context learning (ICL) coding, reasoning, and tool-using ability of LLMs has spurred interest in library learning (i.e., the creation and exploitation of reusable and composable functions, tools, or lemmas). Such systems often promise improved task performance and computational efficiency by caching reasoning (i.e., storing generated tools) - all without finetuning. However, we find strong reasons to be skeptical. Specifically, we identify a serious evaluation flaw present in a large number of ICL library learning works: these works do not correct for the difference in computational cost between baseline and library learning systems. Studying three separately published ICL library learning systems, we find that all of them fail to consistently outperform the simple baseline of prompting the model - improvements in task accuracy often vanish or reverse once computational cost is accounted for. Furthermore, we perform an in-depth examination of one such system, LEGO-Prover, which purports to learn reusable lemmas for mathematical reasoning. We find no evidence of the direct reuse of learned lemmas, and find evidence against the soft reuse of learned lemmas (i.e., reuse by modifying relevant examples).Our findings suggest that a serious re-examination of the effectiveness of ICL LLM-based library learning is required, as is much stronger standards for evaluation. An equal computational budget must be used for baselines, alongside behavioural analysis.

2025

CausalLink: An Interactive Evaluation Framework for Causal Reasoning
Jinyue Feng | Frank Rudzicz
Findings of the Association for Computational Linguistics: ACL 2025

We present CausalLink, an innovative evaluation framework that interactively assesses thecausal reasoning skill to identify the correct intervention in conversational language models. Each CausalLink test case creates a hypothetical environment in which the language models are instructed to apply interventions to entities whose interactions follow predefined causal relations generated from controllable causal graphs. Our evaluation framework isolates causal capabilities from the confounding effects of world knowledge and semantic cues. We evaluate a series of LLMs in a scenario featuring movements of geometric shapes and discover that models start to exhibit reliable reasoning on two or three variables at the 14-billion-parameter scale. However, the performance of state-of-the-art models such as GPT4o degrades below random chance as the number of variables increases. We identify and analyze several key failure modes.

Trustworthy Medical Question Answering: An Evaluation-Centric Survey
Yinuo Wang | Baiyang Wang | Robert Mercer | Frank Rudzicz | Sudipta Singha Roy | Pengjie Ren | Zhumin Chen | Xindi Wang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Trustworthiness in healthcare question-answering (QA) systems is important for ensuring patient safety, clinical effectiveness, and user confidence. As large language models (LLMs) become increasingly integrated into medical settings, the reliability of their responses directly influences clinical decision-making and patient outcomes. However, achieving comprehensive trustworthiness in medical QA poses significant challenges due to the inherent complexity of healthcare data, the critical nature of clinical scenarios, and the multifaceted dimensions of trustworthy AI. In this survey, we systematically examine six key dimensions of trustworthiness in medical QA, i.e., Factuality, Robustness, Fairness, Safety, Explainability, and Calibration. We review how each dimension is evaluated in existing LLM-based medical QA systems. We compile and compare major benchmarks designed to assess these dimensions and analyze evaluation-guided techniques that drive model improvements, such as retrieval-augmented grounding, adversarial fine-tuning, and safety alignment. Finally, we identify open challenges—such as scalable expert evaluation, integrated multi-dimensional metrics, and real-world deployment studies—and propose future research directions to advance the safe, reliable, and transparent deployment of LLM-powered medical QA.

ACCORD: Closing the Commonsense Measurability Gap
François Roewer-Després | Jinyue Feng | Zining Zhu | Frank Rudzicz
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

We present ACCORD, a framework and benchmark suite for disentangling the commonsense grounding and reasoning abilities of large language models (LLMs) through controlled, multi-hop counterfactuals. ACCORD introduces formal elements to commonsense reasoning to explicitly control and quantify reasoning complexity beyond the typical 1 or 2 hops. Uniquely, ACCORD can automatically generate benchmarks of arbitrary reasoning complexity, so it scales with future LLM improvements. Indeed, our experiments on state-of-the-art LLMs show performance degrading to below random chance with only moderate scaling, leaving substantial headroom for improvement. We release a leaderboard of the benchmark suite tested in this work, as well as code for automatically generating more complex benchmarks.

Not Lost After All: How Cross-Encoder Attribution Challenges Position Bias Assumptions in LLM Summarization
Elahe Rahimi | Hassan Sajjad | Domenic Rosati | Abeer Badawi | Elham Dolatabadi | Frank Rudzicz
Findings of the Association for Computational Linguistics: EMNLP 2025

Position bias, the tendency of Large Language Models (LLMs) to select content based on its structural position in a document rather than its semantic relevance, has been viewed as a key limitation in automatic summarization. To measure position bias, prior studies rely heavily on n-gram matching techniques, which fail to capture semantic relationships in abstractive summaries where content is extensively rephrased. To address this limitation, we apply a cross-encoder-based alignment method that jointly processes summary-source sentence pairs, enabling more accurate identification of semantic correspondences even when summaries substantially rewrite the source. Experiments with five LLMs across six summarization datasets reveal significantly different position bias patterns than those reported by traditional metrics. Our findings suggest that these patterns primarily reflect rational adaptations to document structure and content rather than true model limitations. Through controlled experiments and analyses across varying document lengths and multi-document settings, we show that LLMs use content from all positions more effectively than previously assumed, challenging common claims about “lost-in-the-middle” behaviour.

2024

Long-form evaluation of model editing
Domenic Rosati | Robie Gonzales | Jinkun Chen | Xuemin Yu | Yahya Kayani | Frank Rudzicz | Hassan Sajjad
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Evaluations of model editing, a technique for changing the factual knowledge held by Large Language Models (LLMs), currently only use the ‘next few token’ completions after a prompt. As a result, the impact of these methods on longer natural language generation is largely unknown. We introduce long-form evaluation of model editing (LEME) a novel evaluation protocol that measures the efficacy and impact of model editing in long-form generative settings. Our protocol consists of a machine-rated survey and a classifier which correlates well with human ratings. Importantly, we find that our protocol has very little relationship with previous short-form metrics (despite being designed to extend efficacy, generalization, locality, and portability into a long-form setting), indicating that our method introduces a novel set of dimensions for understanding model editing methods. Using this protocol, we benchmark a number of model editing techniques and present several findings including that, while some methods (ROME and MEMIT) perform well in making consistent edits within a limited scope, they suffer much more from factual drift than other methods. Finally, we present a qualitative analysis that illustrates common failure modes in long-form generative settings including internal consistency, lexical cohesion, and locality issues.

Auxiliary Knowledge-Induced Learning for Automatic Multi-Label Medical Document Classification
Xindi Wang | Robert E. Mercer | Frank Rudzicz
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The International Classification of Diseases (ICD) is an authoritative medical classification system of different diseases and conditions for clinical and management purposes. ICD indexing aims to assign a subset of ICD codes to a medical record. Since human coding is labour-intensive and error-prone, many studies employ machine learning techniques to automate the coding process. ICD coding is a challenging task, as it needs to assign multiple codes to each medical document from an extremely large hierarchically organized collection. In this paper, we propose a novel approach for ICD indexing that adopts three ideas: (1) we use a multi-level deep dilated residual convolution encoder to aggregate the information from the clinical notes and learn document representations across different lengths of the texts; (2) we formalize the task of ICD classification with auxiliary knowledge of the medical records, which incorporates not only the clinical texts but also different clinical code terminologies and drug prescriptions for better inferring the ICD codes; and (3) we introduce a graph convolutional network to leverage the co-occurrence patterns among ICD codes, aiming to enhance the quality of label representations. Experimental results show the proposed method achieves state-of-the-art performance on a number of measures.

Multi-stage Retrieve and Re-rank Model for Automatic Medical Coding Recommendation
Xindi Wang | Robert Mercer | Frank Rudzicz
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

The International Classification of Diseases (ICD) serves as a definitive medical classification system encompassing a wide range of diseases and conditions. The primary objective of ICD indexing is to allocate a subset of ICD codes to a medical record, which facilitates standardized documentation and management of various health conditions. Most existing approaches have suffered from selecting the proper label subsets from an extremely large ICD collection with a heavy long-tailed label distribution. In this paper, we leverage a multi-stage “retrieve and re-rank” framework as a novel solution to ICD indexing, via a hybrid discrete retrieval method, and re-rank retrieved candidates with contrastive learning that allows the model to make more accurate predictions from a simplified label space. The retrieval model is a hybrid of auxiliary knowledge of the electronic health records (EHR) and a discrete retrieval method (BM25), which efficiently collects high-quality candidates. In the last stage, we propose a label co-occurrence guided contrastive re-ranking model, which re-ranks the candidate labels by pulling together the clinical notes with positive ICD codes. Experimental results show the proposed method achieves state-of-the-art performance on a number of measures on the MIMIC-III benchmark.

Graph-tree Fusion Model with Bidirectional Information Propagation for Long Document Classification
Sudipta Singha Roy | Xindi Wang | Robert Mercer | Frank Rudzicz
Findings of the Association for Computational Linguistics: EMNLP 2024

Long document classification presents challenges in capturing both local and global dependencies due to their extensive content and complex structure. Existing methods often struggle with token limits and fail to adequately model hierarchical relationships within documents. To address these constraints, we propose a novel model leveraging a graph-tree structure. Our approach integrates syntax trees for sentence encodings and document graphs for document encodings, which capture fine-grained syntactic relationships and broader document contexts, respectively. We use Tree Transformers to generate sentence encodings, while a graph attention network models inter- and intra-sentence dependencies. During training, we implement bidirectional information propagation from word-to-sentence-to-document and vice versa, which enriches the contextual representation. Our proposed method enables a comprehensive understanding of content at all hierarchical levels and effectively handles arbitrarily long contexts without token limit constraints. Experimental results demonstrate the effectiveness of our approach in all types of long document classification tasks.

Immunization against harmful fine-tuning attacks
Domenic Rosati | Jan Wehner | Kai Williams | Lukasz Bartoszcze | Hassan Sajjad | Frank Rudzicz
Findings of the Association for Computational Linguistics: EMNLP 2024

Large Language Models (LLMs) are often trained with safety guards intended to prevent harmful text generation. However, such safety training can be removed by fine-tuning the LLM on harmful datasets. While this emerging threat (harmful fine-tuning attacks) has been characterized by previous work, there is little understanding of how we should proceed in constructing and validating defenses against these attacks especially in the case where defenders would not have control of the fine-tuning process. We introduce a formal framework based on the training budget of an attacker which we call “Immunization” conditions. Using a formal characterisation of the harmful fine-tuning problem, we provide a thorough description of what a successful defense must comprise of and establish a set of guidelines on how rigorous defense research that gives us confidence should proceed.

A Retrieval Augmented Approach for Text-to-Music Generation
Robie Gonzales | Frank Rudzicz
Proceedings of the 3rd Workshop on NLP for Music and Audio (NLP4MusA)

Generative text-to-music models such as MusicGen are capable of generating high fidelity music conditioned on a text prompt. However, expressing the essential features of music with text is a challenging task. In this paper, we present a retrieval-augmented approach for text-to-music generation. We first pre-compute a dataset of text-music embeddings obtained from a contrastive language-audio pretrained encoder (CLAP). Then, given an input text prompt, we retrieve the top k most similar musical aspects and augment the original prompt. This approach consistently generates music of higher audio quality as measured by the Frechét Audio Distance. We analyze the internal representations of MusicGen and find that augmented prompts lead to greater diversity in token distributions and display high text adherence. Our findings show the potential for increased control in text-to-music generation.

2023

Who needs context? Classical techniques for Alzheimer’s disease detection
Behrad Taghibeyglou | Frank Rudzicz
Proceedings of the 5th Clinical Natural Language Processing Workshop

Natural language processing (NLP) has shown great potential for Alzheimer’s disease (AD) detection, particularly due to the adverse effect of AD on spontaneous speech. The current body of literature has directed attention toward context-based models, especially Bidirectional Encoder Representations from Transformers (BERTs), owing to their exceptional abilities to integrate contextual information in a wide range of NLP tasks. This comes at the cost of added model opacity and computational requirements. Taking this into consideration, we propose a Word2Vec-based model for AD detection in 108 age- and sex-matched participants who were asked to describe the Cookie Theft picture. We also investigate the effectiveness of our model by fine-tuning BERT-based sequence classification models, as well as incorporating linguistic features. Our results demonstrate that our lightweight and easy-to-implement model outperforms some of the state-of-the-art models available in the literature, as well as BERT models.

Improving Automatic Quotation Attribution in Literary Novels
Krishnapriya Vishnubhotla | Frank Rudzicz | Graeme Hirst | Adam Hammond
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Current models for quotation attribution in literary novels assume varying levels of available information in their training and test data, which poses a challenge for in-the-wild inference. Here, we approach quotation attribution as a set of four interconnected sub-tasks: character identification, coreference resolution, quotation identification, and speaker attribution. We benchmark state-of-the-art models on each of these sub-tasks independently, using a large dataset of annotated coreferences and quotations in literary novels (the Project Dialogism Novel Corpus). We also train and evaluate models for the speaker attribution task in particular, showing that a simple sequential prediction model achieves accuracy scores on par with state-of-the-art models.

A State-Vector Framework for Dataset Effects
Esmat Sahak | Zining Zhu | Frank Rudzicz
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

The impressive success of recent deep neural network (DNN)-based systems is significantly influenced by the high-quality datasets used in training. However, the effects of the datasets, especially how they interact with each other, remain underexplored. We propose a state-vector framework to enable rigorous studies in this direction. This framework uses idealized probing test results as the bases of a vector space. This framework allows us to quantify the effects of both standalone and interacting datasets. We show that the significant effects of some commonly-used language understanding datasets are characteristic and are concentrated on a few linguistic dimensions. Additionally, we observe some “spill-over” effects: the datasets could impact the models along dimensions that may seem unrelated to the intended tasks. Our state-vector framework paves the way for a systematic understanding of the dataset effects, a crucial component in responsible and robust model development.

2022

KenMeSH: Knowledge-enhanced End-to-end Biomedical Text Labelling
Xindi Wang | Robert Mercer | Frank Rudzicz
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Currently, Medical Subject Headings (MeSH) are manually assigned to every biomedical article published and subsequently recorded in the PubMed database to facilitate retrieving relevant information. With the rapid growth of the PubMed database, large-scale biomedical document indexing becomes increasingly important. MeSH indexing is a challenging task for machine learning, as it needs to assign multiple labels to each article from an extremely large hierachically organized collection. To address this challenge, we propose KenMeSH, an end-to-end model that combines new text features and a dynamic knowledge-enhanced mask attention that integrates document features with MeSH label hierarchy and journal correlation features to index MeSH terms. Experimental results show the proposed method achieves state-of-the-art performance on a number of measures.

On the data requirements of probing
Zining Zhu | Jixuan Wang | Bai Li | Frank Rudzicz
Findings of the Association for Computational Linguistics: ACL 2022

As large and powerful neural language models are developed, researchers have been increasingly interested in developing diagnostic tools to probe them. There are many papers with conclusions of the form “observation X is found in model Y”, using their own datasets with varying sizes. Larger probing datasets bring more reliability, but are also expensive to collect. There is yet to be a quantitative method for estimating reasonable probing dataset sizes. We tackle this omission in the context of comparing two probing configurations: after we have collected a small dataset from a pilot study, how many additional data samples are sufficient to distinguish two different configurations? We present a novel method to estimate the required number of data samples in such experiments and, across several case studies, we verify that our estimations have sufficient statistical power. Our framework helps to systematically construct probing datasets to diagnose neural NLP models.

Building Agent Assistants that can help improve customer service support requires inputs from industry users and their customers, as well as knowledge about state-of-the-art Natural Language Processing (NLP) technology. We combine expertise from academia and industry to bridge the gap and build task/domain-specific Neural Agent Assistants (NAA) with three high-level components for: (1) Intent Identification, (2) Context Retrieval, and (3) Response Generation. In this paper, we outline the pipeline of the NAA’s core system and also present three case studies in which three industry partners successfully adapt the framework to find solutions to their unique challenges. Our findings suggest that a collaborative process is instrumental in spurring the development of emerging NLP models for Conversational AI tasks in industry. The full reference implementation code and results are available at https://github.com/VectorInstitute/NAA.

MeSHup: Corpus for Full Text Biomedical Document Indexing
Xindi Wang | Robert E. Mercer | Frank Rudzicz
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Medical Subject Heading (MeSH) indexing refers to the problem of assigning a given biomedical document with the most relevant labels from an extremely large set of MeSH terms. Currently, the vast number of biomedical articles in the PubMed database are manually annotated by human curators, which is time consuming and costly; therefore, a computational system that can assist the indexing is highly valuable. When developing supervised MeSH indexing systems, the availability of a large-scale annotated text corpus is desirable. A publicly available, large corpus that permits robust evaluation and comparison of various systems is important to the research community. We release a large scale annotated MeSH indexing corpus, MeSHup, which contains 1,342,667 full text articles, together with the associated MeSH labels and metadata, authors and publication venues that are collected from the MEDLINE database. We train an end-to-end model that combines features from documents and their associated labels on our corpus and report the new baseline.

Doctor XAvIer: Explainable Diagnosis on Physician-Patient Dialogues and XAI Evaluation
Hillary Ngai | Frank Rudzicz
Proceedings of the 21st Workshop on Biomedical Language Processing

We introduce Doctor XAvIer — a BERT-based diagnostic system that extracts relevant clinical data from transcribed patient-doctor dialogues and explains predictions using feature attribution methods. We present a novel performance plot and evaluation metric for feature attribution methods — Feature Attribution Dropping (FAD) curve and its Normalized Area Under the Curve (N-AUC). FAD curve analysis shows that integrated gradients outperforms Shapley values in explaining diagnosis classification. Doctor XAvIer outperforms the baseline with 0.97 F1-score in named entity recognition and symptom pertinence classification and 0.91 F1-score in diagnosis classification.

Relevance in Dialogue: Is Less More? An Empirical Comparison of Existing Metrics, and a Novel Simple Metric
Ian Berlot-Attwell | Frank Rudzicz
Proceedings of the 4th Workshop on NLP for Conversational AI

In this work, we evaluate various existing dialogue relevance metrics, find strong dependency on the dataset, often with poor correlation with human scores of relevance, and propose modifications to reduce data requirements and domain sensitivity while improving correlation. Our proposed metric achieves state-of-the-art performance on the HUMOD dataset while reducing measured sensitivity to dataset by 37%-66%. We achieve this without fine-tuning a pretrained language model, and using only 3,750 unannotated human dialogues and a single negative example. Despite these limitations, we demonstrate competitive performance on four datasets from different domains. Our code, including our metric and experiments, is open sourced.

Neural reality of argument structure constructions
Bai Li | Zining Zhu | Guillaume Thomas | Frank Rudzicz | Yang Xu
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In lexicalist linguistic theories, argument structure is assumed to be predictable from the meaning of verbs. As a result, the verb is the primary determinant of the meaning of a clause. In contrast, construction grammarians propose that argument structure is encoded in constructions (or form-meaning pairs) that are distinct from verbs. Two decades of psycholinguistic research have produced substantial empirical evidence in favor of the construction view. Here we adapt several psycholinguistic studies to probe for the existence of argument structure constructions (ASCs) in Transformer-based language models (LMs). First, using a sentence sorting experiment, we find that sentences sharing the same construction are closer in embedding space than sentences sharing the same verb. Furthermore, LMs increasingly prefer grouping by construction with more input data, mirroring the behavior of non-native language learners. Second, in a “Jabberwocky” priming-based experiment, we find that LMs associate ASCs with meaning, even in semantically nonsensical sentences. Our work offers the first evidence for ASCs in LMs and highlights the potential to devise novel probing methods grounded in psycholinguistic research.

Data-driven Approach to Differentiating between Depression and Dementia from Noisy Speech and Language Data
Malikeh Ehghaghi | Frank Rudzicz | Jekaterina Novikova
Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022)

A significant number of studies apply acoustic and linguistic characteristics of human speech as prominent markers of dementia and depression. However, studies on discriminating depression from dementia are rare. Co-morbid depression is frequent in dementia and these clinical conditions share many overlapping symptoms, but the ability to distinguish between depression and dementia is essential as depression is often curable. In this work, we investigate the ability of clustering approaches in distinguishing between depression and dementia from human speech. We introduce a novel aggregated dataset, which combines narrative speech data from multiple conditions, i.e., Alzheimer’s disease, mild cognitive impairment, healthy control, and depression. We compare linear and non-linear clustering approaches and show that non-linear clustering techniques distinguish better between distinct disease clusters. Our interpretability analysis shows that the main differentiating symptoms between dementia and depression are acoustic abnormality, repetitiveness (or circularity) of speech, word finding difficulty, coherence impairment, and differences in lexical complexity and richness.

Detoxifying Language Models with a Toxic Corpus
Yoona Park | Frank Rudzicz
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

Existing studies have investigated the tendency of autoregressive language models to generate contexts that exhibit undesired biases and toxicity. Various debiasing approaches have been proposed, which are primarily categorized into data-based and decoding-based. In our study, we investigate the ensemble of the two debiasing paradigms, proposing to use toxic corpus as an additional resource to reduce the toxicity. Our result shows that toxic corpus can indeed help to reduce the toxicity of the language generation process substantially, complementing the existing debiasing methods.

Predicting Fine-Tuning Performance with Probing
Zining Zhu | Soroosh Shahtalebi | Frank Rudzicz
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Large NLP models have recently shown impressive performance in language understanding tasks, typically evaluated by their fine-tuned performance. Alternatively, probing has received increasing attention as being a lightweight method for interpreting the intrinsic mechanisms of large NLP models. In probing, post-hoc classifiers are trained on “out-of-domain” datasets that diagnose specific abilities. While probing the language models has led to insightful findings, they appear disjointed from the development of models. This paper explores the utility of probing deep NLP models to extract a proxy signal widely used in model development – the fine-tuning performance. We find that it is possible to use the accuracies of only three probing tests to predict the fine-tuning performance with errors 40% - 80% smaller than baselines. We further discuss possible avenues where probing can empower the development of deep NLP models.

2021

TorontoCL at CMCL 2021 Shared Task: RoBERTa with Multi-Stage Fine-Tuning for Eye-Tracking Prediction
Bai Li | Frank Rudzicz
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

Eye movement data during reading is a useful source of information for understanding language comprehension processes. In this paper, we describe our submission to the CMCL 2021 shared task on predicting human reading patterns. Our model uses RoBERTa with a regression layer to predict 5 eye-tracking features. We train the model in two stages: we first fine-tune on the Provo corpus (another eye-tracking dataset), then fine-tune on the task data. We compare different Transformer models and apply ensembling methods to improve the performance. Our final submission achieves a MAE score of 3.929, ranking 3rd place out of 13 teams that participated in this shared task.

How is BERT surprised? Layerwise detection of linguistic anomalies
Bai Li | Zining Zhu | Guillaume Thomas | Yang Xu | Frank Rudzicz
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Transformer language models have shown remarkable ability in detecting when a word is anomalous in context, but likelihood scores offer no information about the cause of the anomaly. In this work, we use Gaussian models for density estimation at intermediate layers of three language models (BERT, RoBERTa, and XLNet), and evaluate our method on BLiMP, a grammaticality judgement benchmark. In lower layers, surprisal is highly correlated to low token frequency, but this correlation diminishes in upper layers. Next, we gather datasets of morphosyntactic, semantic, and commonsense anomalies from psycholinguistic studies; we find that the best performing model RoBERTa exhibits surprisal in earlier layers when the anomaly is morphosyntactic than when it is semantic, while commonsense anomalies do not exhibit surprisal at any intermediate layer. These results suggest that language models employ separate mechanisms to detect different types of linguistic anomalies.

An unsupervised framework for tracing textual sources of moral change
Aida Ramezani | Zining Zhu | Frank Rudzicz | Yang Xu
Findings of the Association for Computational Linguistics: EMNLP 2021

Morality plays an important role in social well-being, but people’s moral perception is not stable and changes over time. Recent advances in natural language processing have shown that text is an effective medium for informing moral change, but no attempt has been made to quantify the origins of these changes. We present a novel unsupervised framework for tracing textual sources of moral change toward entities through time. We characterize moral change with probabilistic topical distributions and infer the source text that exerts prominent influence on the moral time course. We evaluate our framework on a diverse set of data ranging from social media to news articles. We show that our framework not only captures fine-grained human moral judgments, but also identifies coherent source topics of moral change triggered by historical events. We apply our methodology to analyze the news in the COVID-19 pandemic and demonstrate its utility in identifying sources of moral change in high-impact and real-time social events.

An Evaluation of Disentangled Representation Learning for Texts
Krishnapriya Vishnubhotla | Graeme Hirst | Frank Rudzicz
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

On Losses for Modern Language Models
Stéphane Aroca-Ouellette | Frank Rudzicz
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

BERT set many state-of-the-art results over varied NLU benchmarks by pre-training over two tasks: masked language modelling (MLM) and next sentence prediction (NSP), the latter of which has been highly criticized. In this paper, we 1) clarify NSP’s effect on BERT pre-training, 2) explore fourteen possible auxiliary pre-training tasks, of which seven are novel to modern language models, and 3) investigate different ways to include multiple tasks into pre-training. We show that NSP is detrimental to training due to its context splitting and shallow semantic signal. We also identify six auxiliary pre-training tasks – sentence ordering, adjacent sentence prediction, TF prediction, TF-IDF prediction, a FastSent variant, and a Quick Thoughts variant – that outperform a pure MLM baseline. Finally, we demonstrate that using multiple tasks in a multi-task pre-training framework provides better results than using any single auxiliary task. Using these methods, we outperform BERTBase on the GLUE benchmark using fewer than a quarter of the training tokens.

Examining the rhetorical capacities of neural language models
Zining Zhu | Chuer Pan | Mohamed Abdalla | Frank Rudzicz
Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Recently, neural language models (LMs) have demonstrated impressive abilities in generating high-quality discourse. While many recent papers have analyzed the syntactic aspects encoded in LMs, there has been no analysis to date of the inter-sentential, rhetorical knowledge. In this paper, we propose a method that quantitatively evaluates the rhetorical capacities of neural LMs. We examine the capacities of neural LMs understanding the rhetoric of discourse by evaluating their abilities to encode a set of linguistic features derived from Rhetorical Structure Theory (RST). Our experiments show that BERT-based LMs outperform other Transformer LMs, revealing the richer discourse knowledge in their intermediate layer representations. In addition, GPT-2 and XLNet apparently encode less rhetorical knowledge, and we suggest an explanation drawing from linguistic philosophy. Our method shows an avenue towards quantifying the rhetorical capacities of neural LMs.

Representation Learning for Discovering Phonemic Tone Contours
Bai Li | Jing Yi Xie | Frank Rudzicz
Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

Tone is a prosodic feature used to distinguish words in many languages, some of which are endangered and scarcely documented. In this work, we use unsupervised representation learning to identify probable clusters of syllables that share the same phonemic tone. Our method extracts the pitch for each syllable, then trains a convolutional autoencoder to learn a low-dimensional representation for each contour. We then apply the mean shift algorithm to cluster tones in high-density regions of the latent space. Furthermore, by feeding the centers of each cluster into the decoder, we produce a prototypical contour that represents each cluster. We apply this method to spoken multi-syllable words in Mandarin Chinese and Cantonese and evaluate how closely our clusters match the ground truth tone categories. Finally, we discuss some difficulties with our approach, including contextual tone variation and allophony effects.

An information theoretic view on selecting linguistic probes
Zining Zhu | Frank Rudzicz
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

There is increasing interest in assessing the linguistic knowledge encoded in neural representations. A popular approach is to attach a diagnostic classifier – or ”probe” – to perform supervised classification from internal representations. However, how to select a good probe is in debate. Hewitt and Liang (2019) showed that a high performance on diagnostic classification itself is insufficient, because it can be attributed to either ”the representation being rich in knowledge”, or ”the probe learning the task”, which Pimentel et al. (2020) challenged. We show this dichotomy is valid information-theoretically. In addition, we find that the ”good probe” criteria proposed by the two papers, *selectivity* (Hewitt and Liang, 2019) and *information gain* (Pimentel et al., 2020), are equivalent – the errors of their approaches are identical (modulo irrelevant terms). Empirically, these two selection criteria lead to results that highly agree with each other.

Word class flexibility: A deep contextualized approach
Bai Li | Guillaume Thomas | Yang Xu | Frank Rudzicz
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Word class flexibility refers to the phenomenon whereby a single word form is used across different grammatical categories. Extensive work in linguistic typology has sought to characterize word class flexibility across languages, but quantifying this phenomenon accurately and at scale has been fraught with difficulties. We propose a principled methodology to explore regularity in word class flexibility. Our method builds on recent work in contextualized word embeddings to quantify semantic shift between word classes (e.g., noun-to-verb, verb-to-noun), and we apply this method to 37 languages. We find that contextualized embeddings not only capture human judgment of class variation within words in English, but also uncover shared tendencies in class flexibility across languages. Specifically, we find greater semantic variation when flexible lemmas are used in their dominant word class, supporting the view that word class flexibility is a directional process. Our work highlights the utility of deep contextualized models in linguistic typology.

Explainable Clinical Decision Support from Text
Jinyue Feng | Chantal Shaib | Frank Rudzicz
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Clinical prediction models often use structured variables and provide outcomes that are not readily interpretable by clinicians. Further, free-text medical notes may contain information not immediately available in structured variables. We propose a hierarchical CNN-transformer model with explicit attention as an interpretable, multi-task clinical language model, which achieves an AUROC of 0.75 and 0.78 on sepsis and mortality prediction, respectively. We also explore the relationships between learned features from structured and unstructured variables using projection-weighted canonical correlation analysis. Finally, we outline a protocol to evaluate model usability in a clinical decision support context. From domain-expert evaluations, our model generates informative rationales that have promising real-life applications.

Identification of Primary and Collateral Tracks in Stuttered Speech
Rachid Riad | Anne-Catherine Bachoud-Lévi | Frank Rudzicz | Emmanuel Dupoux
Proceedings of the Twelfth Language Resources and Evaluation Conference

Disfluent speech has been previously addressed from two main perspectives: the clinical perspective focusing on diagnostic, and the Natural Language Processing (NLP) perspective aiming at modeling these events and detect them for downstream tasks. In addition, previous works often used different metrics depending on whether the input features are text or speech, making it difficult to compare the different contributions. Here, we introduce a new evaluation framework for disfluency detection inspired by the clinical and NLP perspective together with the theory of performance from (Clark, 1996) which distinguishes between primary and collateral tracks. We introduce a novel forced-aligned disfluency dataset from a corpus of semi-directed interviews, and present baseline results directly comparing the performance of text-based features (word and span information) and speech-based (acoustic-prosodic information). Finally, we introduce new audio features inspired by the word-based span features. We show experimentally that using these features outperformed the baselines for speech-based predictions on the present dataset.

Exploring Text Specific and Blackbox Fairness Algorithms in Multimodal Clinical NLP
John Chen | Ian Berlot-Attwell | Xindi Wang | Safwan Hossain | Frank Rudzicz
Proceedings of the 3rd Clinical Natural Language Processing Workshop

Clinical machine learning is increasingly multimodal, collected in both structured tabular formats and unstructured forms such as free text. We propose a novel task of exploring fairness on a multimodal clinical dataset, adopting equalized odds for the downstream medical prediction tasks. To this end, we investigate a modality-agnostic fairness algorithm - equalized odds post processing - and compare it to a text-specific fairness algorithm: debiased clinical word embeddings. Despite the fact that debiased word embeddings do not explicitly address equalized odds of protected groups, we show that a text-specific approach to fairness may simultaneously achieve a good balance of performance classical notions of fairness. Our work opens the door for future work at the critical intersection of clinical NLP and fairness.

2019

Predicting ICU transfers using text messages between nurses and doctors
Faiza Khan Khattak | Chloé Pou-Prom | Robert Wu | Frank Rudzicz
Proceedings of the 2nd Clinical Natural Language Processing Workshop

We explore the use of real-time clinical information, i.e., text messages sent between nurses and doctors regarding patient conditions in order to predict transfer to the intensive care unit(ICU). Preliminary results, in data from five hospitals, indicate that, despite being short and full of noise, text messages can augment other visit information to improve the performance of ICU transfer prediction.

Generative Adversarial Networks for Text Using Word2vec Intermediaries
Akshay Budhkar | Krishnapriya Vishnubhotla | Safwan Hossain | Frank Rudzicz
Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)

Generative adversarial networks (GANs) have shown considerable success, especially in the realistic generation of images. In this work, we apply similar techniques for the generation of text. We propose a novel approach to handle the discrete nature of text, during training, using word embeddings. Our method is agnostic to vocabulary size and achieves competitive results relative to methods with various discrete gradient estimators.

Extracting relevant information from physician-patient dialogues for automated clinical note taking
Serena Jeblee | Faiza Khan Khattak | Noah Crampton | Muhammad Mamdani | Frank Rudzicz
Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019)

We present a system for automatically extracting pertinent medical information from dialogues between clinicians and patients. The system parses each dialogue and extracts entities such as medications and symptoms, using context to predict which entities are relevant. We also classify the primary diagnosis for each conversation. In addition, we extract topic information and identify relevant utterances. This serves as a baseline for a system that extracts information from dialogues and automatically generates a patient note, which can be reviewed and edited by the clinician.

Lexical Features Are More Vulnerable, Syntactic Features Have More Predictive Power
Jekaterina Novikova | Aparna Balagopalan | Ksenia Shkaruta | Frank Rudzicz
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

Understanding the vulnerability of linguistic features extracted from noisy text is important for both developing better health text classification models and for interpreting vulnerabilities of natural language models. In this paper, we investigate how generic language characteristics, such as syntax or the lexicon, are impacted by artificial text alterations. The vulnerability of features is analysed from two perspectives: (1) the level of feature value change, and (2) the level of change of feature predictive power as a result of text modifications. We show that lexical features are more sensitive to text modifications than syntactic ones. However, we also demonstrate that these smaller changes of syntactic features have a stronger influence on classification performance downstream, compared to the impact of changes to lexical features. Results are validated across three datasets representing different text-classification tasks, with different levels of lexical and syntactic complexity of both conversational and written language.

Proceedings of the Eighth Workshop on Speech and Language Processing for Assistive Technologies
Heidi Christensen | Kristy Hollingshead | Emily Prud’hommeaux | Frank Rudzicz | Keith Vertanen
Proceedings of the Eighth Workshop on Speech and Language Processing for Assistive Technologies

Detecting cognitive impairments by agreeing on interpretations of linguistic features
Zining Zhu | Jekaterina Novikova | Frank Rudzicz
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Linguistic features have shown promising applications for detecting various cognitive impairments. To improve detection accuracies, increasing the amount of data or the number of linguistic features have been two applicable approaches. However, acquiring additional clinical data can be expensive, and hand-crafting features is burdensome. In this paper, we take a third approach, proposing Consensus Networks (CNs), a framework to classify after reaching agreements between modalities. We divide linguistic features into non-overlapping subsets according to their modalities, and let neural networks learn low-dimensional representations that agree with each other. These representations are passed into a classifier network. All neural networks are optimized iteratively. In this paper, we also present two methods that improve the performance of CNs. We then present ablation studies to illustrate the effectiveness of modality division. To understand further what happens in CNs, we visualize the representations during training. Overall, using all of the 413 linguistic features, our models significantly outperform traditional classifiers, which are used by the state-of-the-art papers.

Multilingual prediction of Alzheimer’s disease through domain adaptation and concept-based language modelling
Kathleen C. Fraser | Nicklas Linz | Bai Li | Kristina Lundholm Fors | Frank Rudzicz | Alexandra König | Jan Alexandersson | Philippe Robert | Dimitrios Kokkinakis
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

There is growing evidence that changes in speech and language may be early markers of dementia, but much of the previous NLP work in this area has been limited by the size of the available datasets. Here, we compare several methods of domain adaptation to augment a small French dataset of picture descriptions (n = 57) with a much larger English dataset (n = 550), for the task of automatically distinguishing participants with dementia from controls. The first challenge is to identify a set of features that transfer across languages; in addition to previously used features based on information units, we introduce a new set of features to model the order in which information units are produced by dementia patients and controls. These concept-based language model features improve classification performance in both English and French separately, and the best result (AUC = 0.89) is achieved using the multilingual training set with a combination of information and language model features.

Augmenting word2vec with latent Dirichlet allocation within a clinical application
Akshay Budhkar | Frank Rudzicz
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

This paper presents three hybrid models that directly combine latent Dirichlet allocation and word embedding for distinguishing between speakers with and without Alzheimer’s disease from transcripts of picture descriptions. Two of our models get F-scores over the current state-of-the-art using automatic methods on the DementiaBank dataset.

How do we feel when a robot dies? Emotions expressed on Twitter before and after hitchBOT’s destruction
Kathleen C. Fraser | Frauke Zeller | David Harris Smith | Saif Mohammad | Frank Rudzicz
Proceedings of the Tenth Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

In 2014, a chatty but immobile robot called hitchBOT set out to hitchhike across Canada. It similarly made its way across Germany and the Netherlands, and had begun a trip across the USA when it was destroyed by vandals. In this work, we analyze the emotions and sentiments associated with words in tweets posted before and after hitchBOT’s destruction to answer two questions: Were there any differences in the emotions expressed across the different countries visited by hitchBOT? And how did the public react to the demise of hitchBOT? Our analyses indicate that while there were few cross-cultural differences in sentiment towards hitchBOT, there was a significant negative emotional reaction to its destruction, suggesting that people had formed an emotional connection with hitchBOT and perceived its destruction as morally wrong. We discuss potential implications of anthropomorphism and emotional attachment to robots from the perspective of robot ethics.

Detecting dementia in Mandarin Chinese using transfer learning from a parallel corpus
Bai Li | Yi-Te Hsu | Frank Rudzicz
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Machine learning has shown promise for automatic detection of Alzheimer’s disease (AD) through speech; however, efforts are hampered by a scarcity of data, especially in languages other than English. We propose a method to learn a correspondence between independently engineered lexicosyntactic features in two languages, using a large parallel corpus of out-of-domain movie dialogue data. We apply it to dementia detection in Mandarin Chinese, and demonstrate that our method outperforms both unilingual and machine translation-based baselines. This appears to be the first study that transfers feature domains in detecting cognitive decline.

2018

Learning multiview embeddings for assessing dementia
Chloé Pou-Prom | Frank Rudzicz
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

As the incidence of Alzheimer’s Disease (AD) increases, early detection becomes crucial. Unfortunately, datasets for AD assessment are often sparse and incomplete. In this work, we leverage the multiview nature of a small AD dataset, DementiaBank, to learn an embedding that captures different modes of cognitive impairment. We apply generalized canonical correlation analysis (GCCA) to our dataset and demonstrate the added benefit of using multiview embeddings in two downstream tasks: identifying AD and predicting clinical scores. By including multiview embeddings, we obtain an F1 score of 0.82 in the classification task and a mean absolute error of 3.42 in the regression task. Furthermore, we show that multiview embeddings can be obtained from other datasets as well.

2017

Identifying and Avoiding Confusion in Dialogue with People with Alzheimer’s Disease
Hamidreza Chinaei | Leila Chan Currie | Andrew Danks | Hubert Lin | Tejas Mehta | Frank Rudzicz
Computational Linguistics, Volume 43, Issue 2 - June 2017

Alzheimer’s disease (AD) is an increasingly prevalent cognitive disorder in which memory, language, and executive function deteriorate, usually in that order. There is a growing need to support individuals with AD and other forms of dementia in their daily lives, and our goal is to do so through speech-based interaction. Given that 33% of conversations with people with middle-stage AD involve a breakdown in communication, it is vital that automated dialogue systems be able to identify those breakdowns and, if possible, avoid them. In this article, we discuss several linguistic features that are verbal indicators of confusion in AD (including vocabulary richness, parse tree structures, and acoustic cues) and apply several machine learning algorithms to identify dialogue-relevant confusion from speech with up to 82% accuracy. We also learn dialogue strategies to avoid confusion in the first place, which is accomplished using a partially observable Markov decision process and which obtains accuracies (up to 96.1%) that are significantly higher than several baselines. This work represents a major step towards automated dialogue systems for individuals with dementia.

Detecting Anxiety through Reddit
Judy Hanwen Shen | Frank Rudzicz
Proceedings of the Fourth Workshop on Computational Linguistics and Clinical Psychology — From Linguistic Signal to Clinical Reality

Previous investigations into detecting mental illnesses through social media have predominately focused on detecting depression through Twitter corpora. In this paper, we study anxiety disorders through personal narratives collected through the popular social media website, Reddit. We build a substantial data set of typical and anxiety-related posts, and we apply N-gram language modeling, vector embeddings, topic analysis, and emotional norms to generate features that accurately classify posts related to binary levels of anxiety. We achieve an accuracy of 91% with vector-space word embeddings, and an accuracy of 98% when combined with lexicon-based features.

2016

Detecting late-life depression in Alzheimer’s disease through analysis of speech and language
Kathleen C. Fraser | Frank Rudzicz | Graeme Hirst
Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology

Vector-space topic models for detecting Alzheimer’s disease
Maria Yancheva | Frank Rudzicz
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

Using linguistic features longitudinally to predict clinical scores for Alzheimer’s disease and related dementias
Maria Yancheva | Kathleen Fraser | Frank Rudzicz
Proceedings of SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies

Remote Speech Technology for Speech Professionals - the CloudCAST initiative
Phil Green | Ricard Marxer | Stuart Cunningham | Heidi Christensen | Frank Rudzicz | Maria Yancheva | André Coy | Massimuliano Malavasi | Lorenzo Desideri
Proceedings of SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies

Proceedings of SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies
Jan Alexandersson | Ercan Altinsoy | Heidi Christensen | Peter Ljunglöf | François Portet | Frank Rudzicz
Proceedings of SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies

Automatic dysfluency detection in dysarthric speech using deep belief networks
Stacey Oue | Ricard Marxer | Frank Rudzicz
Proceedings of SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies

2014

Speech recognition in Alzheimer’s disease with personal assistive robots
Frank Rudzicz | Rosalie Wang | Momotaz Begum | Alex Mihailidis
Proceedings of the 5th Workshop on Speech and Language Processing for Assistive Technologies

Proceedings of the 5th Workshop on Speech and Language Processing for Assistive Technologies
Jan Alexandersson | Dimitra Anastasiou | Cui Jian | Ani Nenkova | Rupal Patel | Frank Rudzicz | Annalu Waller | Desislava Zhekova
Proceedings of the 5th Workshop on Speech and Language Processing for Assistive Technologies

2013

Automatic speech recognition in the diagnosis of primary progressive aphasia
Kathleen Fraser | Frank Rudzicz | Naida Graham | Elizabeth Rochon
Proceedings of the Fourth Workshop on Speech and Language Processing for Assistive Technologies

Proceedings of the Fourth Workshop on Speech and Language Processing for Assistive Technologies
Jan Alexandersson | Peter Ljunglöf | Kathleen F. McCoy | François Portet | Brian Roark | Frank Rudzicz | Michel Vacher
Proceedings of the Fourth Workshop on Speech and Language Processing for Assistive Technologies

Automatic detection of deception in child-produced speech using syntactic complexity features
Maria Yancheva | Frank Rudzicz
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2012

Communication strategies for a computerized caregiver for individuals with Alzheimer’s disease
Frank Rudzicz | Rozanne Wilson | Alex Mihailidis | Elizabeth Rochon | Carol Leonard
Proceedings of the Third Workshop on Speech and Language Processing for Assistive Technologies

2011

Acoustic transformations to improve the intelligibility of dysarthric speech
Frank Rudzicz
Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technologies

2010

Towards a noisy-channel model of dysarthria in speech recognition
Frank Rudzicz
Proceedings of the NAACL HLT 2010 Workshop on Speech and Language Processing for Assistive Technologies

Correcting Errors in Speech Recognition with Articulatory Dynamics
Frank Rudzicz
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

Proceedings of the NAACL HLT 2010 Student Research Workshop
Julia Hockenmaier | Diane Litman | Adriane Boyd | Mahesh Joshi | Frank Rudzicz
Proceedings of the NAACL HLT 2010 Student Research Workshop

2009

Summarizing multiple spoken documents: finding evidence from untranscribed audio
Xiaodan Zhu | Gerald Penn | Frank Rudzicz
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

2006

Clavius: Bi-Directional Parsing for Generic Multimodal Interaction
Frank Rudzicz
Proceedings of the COLING/ACL 2006 Student Research Workshop

Co-authors

Jan Alexandersson 4

Maria Yancheva 4

Ian Berlot-Attwell 3

Heidi Christensen 3

Elham Dolatabadi 3

Faiza Khan Khattak 3

Jekaterina Novikova 3

Domenic Rosati 3

Hassan Sajjad 3

Guillaume Thomas 3

Krishnapriya Vishnubhotla 3

Akshay Budhkar 2

Robie Gonzales 2

Safwan Hossain 2

Peter Ljunglöf 2

Ricard Marxer 2

Alex Mihailidis 2

François Portet 2

Chloé Pou-Prom 2

Elizabeth Rochon 2

Sudipta Singha Roy 2

Mohamed Abdalla 1

Ercan Altinsoy 1

Dimitra Anastasiou 1

Kwesi P. Apponsah 1

Stéphane Aroca-Ouellette 1

Anne-Catherine Bachoud-Lévi 1

Aparna Balagopalan 1

Lukasz Bartoszcze 1

Momotaz Begum 1

Lindsay Bertrand 1

Hamidreza Chinaei 1

Noah Crampton 1

Stuart Cunningham 1

Leila Chan Currie 1

Lorenzo Desideri 1

Prathiba Dhanesh 1

Emmanuel Dupoux 1

Malikeh Ehghaghi 1

Kristina Lundholm Fors 1

Julia Hockenmaier 1

Kristy Hollingshead 1

Jimmy Xiangji Huang 1

Bukola Ishola 1

Serena Jeblee 1

Karthik Raja K. Bhaskar 1

Dimitrios Kokkinakis 1

Alexandra König 1

Md Tahmid Rahman Laskar 1

Carol Leonard 1

Massimuliano Malavasi 1

Muhammad Mamdani 1

Kathleen F. McCoy 1

Robert Mercer 1

Saif Mohammad 1

Waqar Muhammad 1

Jaswinder Narain 1

Jingcheng Niu 1

Stephen Obadinma 1

Kanishk Patel 1

Emily Prud’hommeaux 1

Aida Ramezani 1

Philippe Robert 1

Sean Robertson 1

François Roewer-Després 1

Tobias Sesterhenn 1

Soroosh Shahtalebi 1

Chantal Shaib 1

Judy Hanwen Shen 1

Ksenia Shkaruta 1

David Harris Smith 1

Behrad Taghibeyglou 1

Griffin Tanner 1

Michel Vacher 1

Keith Vertanen 1

Annalu Waller 1

Rozanne Wilson 1

Frauke Zeller 1

Sean X. Zhang 1

Desislava Zhekova 1

Venues