Steven Bedrick

2025

2024

2023

pdf bib
Computational Analysis of Backchannel Usage and Overlap Length in Autistic Children
Grace O. Lawley | Peter A. Heeman | Steven Bedrick
Proceedings of the First Workshop on Connecting Multiple Disciplines to AI Techniques in Interaction-centric Autism Research and Diagnosis (ICARD 2023)

pdf bib abs
Mixed Orthographic/Phonemic Language Modeling: Beyond Orthographically Restricted Transformers (BORT)
Robert C. Gale | Alexandra C. Salem | Gerasimos Fergadiotis | Steven Bedrick
Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023)

Speech language pathologists rely on information spanning the layers of language, often drawing from multiple layers (e.g. phonology & semantics) at once. Recent innovations in large language models (LLMs) have been shown to build powerful representations for many complex language structures, especially syntax and semantics, unlocking the potential of large datasets through self-supervised learning techniques. However, these datasets are overwhelmingly orthographic, favoring writing systems like the English alphabet, a natural but phonetically imprecise choice. Meanwhile, LLM support for the international phonetic alphabet (IPA) ranges from poor to absent. Further, LLMs encode text at a word- or near-word level, and pre-training tasks have little to gain from phonetic/phonemic representations. In this paper, we introduce BORT, an LLM for mixed orthography/IPA meant to overcome these limitations. To this end, we extend the pre-training of an existing LLM with our own self-supervised pronunciation tasks. We then fine-tune for a clinical task that requires simultaneous phonological and semantic analysis. For an “easy” and “hard” version of these tasks, we show that fine-tuning from our models is more accurate by a relative 24% and 29%, and improved on character error rates by a relative 75% and 31%, respectively, than those starting from the original model.

pdf bib abs
A Statistical Approach for Quantifying Group Difference in Topic Distributions Using Clinical Discourse Samples
Grace O. Lawley | Peter A. Heeman | Jill K. Dolata | Eric Fombonne | Steven Bedrick
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Topic distribution matrices created by topic models are typically used for document classification or as features in a separate machine learning algorithm. Existing methods for evaluating these topic distributions include metrics such as coherence and perplexity; however, there is a lack of statistically grounded evaluation tools. We present a statistical method for investigating group differences in the document-topic distribution vectors created by Latent Dirichlet Allocation (LDA) that uses Aitchison geometry to transform the vectors, multivariate analysis of variance (MANOVA) to compare sample means, and partial eta squared to calculate effect size. Using a corpus of dialogues between Autistic and Typically Developing (TD) children and trained examiners, we found that the topic distributions of Autistic children differed from those of TD children when responding to questions about social difficulties (p = .0083, partial eta squared = .19). Furthermore, the examiners’ topic distributions differed between the Autistic and TD groups when discussing emotions (p = .0035, partial eta squared = .20), social difficulties (p < .001, partial eta squared = .30), and friends (p = .0224, partial eta squared = .17). These results support the use of topic modeling in studying clinically relevant features of social communication such as topic maintenance.

2022

pdf bib abs
The Post-Stroke Speech Transcription (PSST) Challenge
Robert C. Gale | Mikala Fleegle | Gerasimos Fergadiotis | Steven Bedrick
Proceedings of the RaPID Workshop - Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments - within the 13th Language Resources and Evaluation Conference

We present the outcome of the Post-Stroke Speech Transcription (PSST) challenge. For the challenge, we prepared a new data resource of responses to two confrontation naming tests found in AphasiaBank, extracting audio and adding new phonemic transcripts for each response. The challenge consisted of two tasks. Task A asked challengers to build an automatic speech recognizer (ASR) for phonemic transcription of the PSST samples, evaluated in terms of phoneme error rate (PER) as well as a finer-grained metric derived from phonological feature theory, feature error rate (FER). The best model had a 9.9% FER / 20.0% PER, improving on our baseline by a relative 18% and 24%, respectively. Task B approximated a downstream assessment task, asking challengers to identify whether each recording contained a correctly pronounced target word. Challengers were unable to improve on the baseline algorithm; however, using this algorithm with the improved transcripts from Task A resulted in 92.8% accuracy / 0.921 F1, a relative improvement of 2.8% and 3.3%, respectively.

2021

pdf bib abs
Refocusing on Relevance: Personalization in NLG
Shiran Dudy | Steven Bedrick | Bonnie Webber
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Many NLG tasks such as summarization, dialogue response, or open domain question answering, focus primarily on a source text in order to generate a target response. This standard approach falls short, however, when a user’s intent or context of work is not easily recoverable based solely on that source text– a scenario that we argue is more of the rule than the exception. In this work, we argue that NLG systems in general should place a much higher level of emphasis on making use of additional context, and suggest that relevance (as used in Information Retrieval) be thought of as a crucial tool for designing user-oriented text-generating tasks. We further discuss possible harms and hazards around such personalization, and argue that value-sensitive design represents a crucial path forward through these challenges.

2020

pdf bib abs
Are Some Words Worth More than Others?
Shiran Dudy | Steven Bedrick
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

Current evaluation metrics for language modeling and generation rely heavily on the accuracy of predicted (or generated) words as compared to a reference ground truth. While important, token-level accuracy only captures one aspect of a language model’s behavior, and ignores linguistic properties of words that may allow some mis-predicted tokens to be useful in practice. Furthermore, statistics directly tied to prediction accuracy (including perplexity) may be confounded by the Zipfian nature of written language, as the majority of the prediction attempts will occur with frequently-occurring types. A model’s performance may vary greatly between high- and low-frequency words, which in practice could lead to failure modes such as repetitive and dull generated text being produced by a downstream consumer of a language model. To address this, we propose two new intrinsic evaluation measures within the framework of a simple word prediction task that are designed to give a more holistic picture of a language model’s performance. We evaluate several commonly-used large English language models using our proposed metrics, and demonstrate that our approach reveals functional differences in performance between the models that are obscured by more traditional metrics.

bib abs
Long-Tail Predictions with Continuous-Output Language Models
Shiran Dudy | Steven Bedrick
Proceedings of the Fourth Widening Natural Language Processing Workshop

Neural language models typically employ a categorical approach to prediction and training, leading to well-known computational and numerical limitations. An under-explored alternative approach is to perform prediction directly against a continuous word embedding space, which according to recent research is more akin to how lexemes are represented in the brain. Choosing this method opens the door for for large-vocabulary, language models and enables substantially smaller and simpler computational complexities. In this research we explore a different important trait - the continuous output prediction models reach low-frequency vocabulary words which we show are often ignored by the categorical model. Such words are essential, as they can contribute to personalization and user vocabulary adaptation. In this work, we explore continuous-space language modeling in the context of a word prediction task over two different textual domains (newswire text and biomedical journal articles). We investigate both traditional and adversarial training approaches, and report results using several different embedding spaces and decoding mechanisms. We find that our continuous-prediction approach outperforms the standard categorical approach in terms of term diversity, in particular with rare words.

2019

pdf bib abs
We Need to Talk about Standard Splits
Kyle Gorman | Steven Bedrick
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

It is standard practice in speech & language technology to rank systems according to their performance on a test set held out for evaluation. However, few researchers apply statistical tests to determine whether differences in performance are likely to arise by chance, and few examine the stability of system ranking across multiple training-testing splits. We conduct replication and reproduction experiments with nine part-of-speech taggers published between 2000 and 2018, each of which claimed state-of-the-art performance on a widely-used “standard split”. While we replicate results on the standard split, we fail to reliably reproduce some rankings when we repeat this analysis with randomly generated training-testing splits. We argue that randomly generated splits should be used in system evaluation.

pdf bib abs
Noisy Neural Language Modeling for Typing Prediction in BCI Communication
Rui Dong | David Smith | Shiran Dudy | Steven Bedrick
Proceedings of the Eighth Workshop on Speech and Language Processing for Assistive Technologies

Language models have broad adoption in predictive typing tasks. When the typing history contains numerous errors, as in open-vocabulary predictive typing with brain-computer interface (BCI) systems, we observe significant performance degradation in both n-gram and recurrent neural network language models trained on clean text. In evaluations of ranking character predictions, training recurrent LMs on noisy text makes them much more robust to noisy histories, even when the error model is misspecified. We also propose an effective strategy for combining evidence from multiple ambiguous histories of BCI electroencephalogram measurements.

pdf bib abs
Classification of Semantic Paraphasias: Optimization of a Word Embedding Model
Katy McKinney-Bock | Steven Bedrick
Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP

In clinical assessment of people with aphasia, impairment in the ability to recall and produce words for objects (anomia) is assessed using a confrontation naming task, where a target stimulus is viewed and a corresponding label is spoken by the participant. Vector space word embedding models have had inital results in assessing semantic similarity of target-production pairs in order to automate scoring of this task; however, the resulting models are also highly dependent upon training parameters. To select an optimal family of models, we fit a beta regression model to the distribution of performance metrics on a set of 2,880 grid search models and evaluate the resultant first- and second-order effects to explore how parameterization affects model performance. Comparing to SimLex-999, we show that clinical data can be used in an evaluation task with comparable optimal parameter settings as standard NLP evaluation datasets.

2018

pdf bib abs
A Multi-Context Character Prediction Model for a Brain-Computer Interface
Shiran Dudy | Shaobin Xu | Steven Bedrick | David Smith
Proceedings of the Second Workshop on Subword/Character LEvel Models

Brain-computer interfaces and other augmentative and alternative communication devices introduce language-modeing challenges distinct from other character-entry methods. In particular, the acquired signal of the EEG (electroencephalogram) signal is noisier, which, in turn, makes the user intent harder to decipher. In order to adapt to this condition, we propose to maintain ambiguous history for every time step, and to employ, apart from the character language model, word information to produce a more robust prediction system. We present preliminary results that compare this proposed Online-Context Language Model (OCLM) to current algorithms that are used in this type of setting. Evaluation on both perplexity and predictive accuracy demonstrates promising results when dealing with ambiguous histories in order to provide to the front end a distribution of the next character the user might type.

pdf bib abs
Compositional Language Modeling for Icon-Based Augmentative and Alternative Communication
Shiran Dudy | Steven Bedrick
Proceedings of the Workshop on Deep Learning Approaches for Low-Resource NLP

Icon-based communication systems are widely used in the field of Augmentative and Alternative Communication. Typically, icon-based systems have lagged behind word- and character-based systems in terms of predictive typing functionality, due to the challenges inherent to training icon-based language models. We propose a method for synthesizing training data for use in icon-based language models, and explore two different modeling strategies. We propose a method to generate language models for corpus-less symbol-set.

2017

pdf bib abs
Target word prediction and paraphasia classification in spoken discourse
Joel Adams | Steven Bedrick | Gerasimos Fergadiotis | Kyle Gorman | Jan van Santen
Proceedings of the 16th BioNLP Workshop

We present a system for automatically detecting and classifying phonologically anomalous productions in the speech of individuals with aphasia. Working from transcribed discourse samples, our system identifies neologisms, and uses a combination of string alignment and language models to produce a lattice of plausible words that the speaker may have intended to produce. We then score this lattice according to various features, and attempt to determine whether the anomalous production represented a phonemic error or a genuine neologism. This approach has the potential to be expanded to consider other types of paraphasic errors, and could be applied to a wide variety of screening and therapeutic applications.

2016

Privacy concerns have often served as an insurmountable barrier for the production of research and resources in clinical information retrieval (IR). We believe that both clinical IR research innovation and legitimate privacy concerns can be served by the creation of intra-institutional, fully protected resources. In this paper, we provide some principles and tools for IR resource-building in the unique problem setting of patient-level IR, following the tradition of the Cranfield paradigm.