Emily Prud’hommeaux

Also published as: Emily T. Prud’hommeaux, Emily Prud'hommeaux, Emily Prudhommeaux

2025

pdf bib abs
That doesn’t sound right: Evaluating speech transcription quality in field linguistics corpora
Eric Le Ferrand | Bo Jiang | Joshua Hartshorne | Emily Prud’hommeaux
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Incorporating automatic speech recognition (ASR) into field linguistics workflows for language documentation has become increasingly common. While ASR performance has seen improvements in low-resource settings, obstacles remain when training models on data collected by documentary linguists. One notable challenge lies in the way that this data is curated. ASR datasets built from spontaneous speech are typically recorded in consistent settings and transcribed by native speakers following a set of well designed guidelines. In contrast, field linguists collect data in whatever format it is delivered by their language consultants and transcribe it as best they can given their language skills and the quality of the recording. This approach to data curation, while valuable for linguistic research, does not always align with the standards required for training robust ASR models. In this paper, we explore methods for identifying speech transcriptions in fieldwork data that may be unsuitable for training ASR models. We focus on two complimentary automated measures of transcription quality that can be used to identify transcripts with characteristics that are common in field data but could be detrimental to ASR training. We show that one of the metrics is highly effective at retrieving these types of transcriptions. Additionally, we find that filtering datasets using this metric of transcription quality reduces WER both in controlled experiments using simulated fieldwork with artificially corrupted data and in real fieldwork corpora.

pdf bib abs
Integrating diverse corpora for training an endangered language machine translation system
Hunter Scheppat | Joshua Hartshorne | Dylan Leddy | Eric Le Ferrand | Emily Prudhommeaux
Proceedings of the Eight Workshop on the Use of Computational Methods in the Study of Endangered Languages

Machine translation (MT) can be a useful technology for language documentation and for promoting language use in endangered language communities. Few endangered languages, however, have an existing parallel corpus large enough to train a reasonable MT model. In this paper, we re-purpose a wide range of diverse data sources containing Amis, English, and Mandarin text to serve as parallel corpora for training MT systems for Amis, one of the Indigenous languages of Taiwan. To supplement the small amount of Amis-English data, we produce synthetic Amis-English data by using a high quality MT system to generate English translations for the Mandarin side of the Amis-Mandarin corpus. Using two popular neural MT systems, OpenNMT and NLLB, we train models to translate between English and Amis, and Mandarin and Amis. We find that including synthetic data is helpful only when translating to English. In addition, we observe that neither MT architecture is consistently superior to other and that performance seems to vary according to the direction of translation and the amount of data used. These results indicate that MT is possible for an under-resourced language even without a formally prepared parallel corpus, but multiple training methods should be explored to produce optimal results.

pdf bib abs
What data should I include in my POS tagging training set?
Zoey Liu | Masoud Jasbi | Christan Grant | Kenji Sagae | Emily Prud’hommeaux
Findings of the Association for Computational Linguistics: EMNLP 2025

Building an NLP training set for understudied languages, including Indigenous and endangered languages, often faces challenges due to varying degrees of resource limitations in the speaker communities. What are some reasonable approaches for training set construction in these cases? We address this question with POS tagging as the test case. Although many might consider POS tagging “a solved problem”, it remains a crucial task for descriptive linguistics and language documentation and requires laborious manual annotation. Drawing data from 12 language families, we compare in-context learning, active learning (AL), and random sampling. Our results suggest: (1) for communities whose language data can be ethically shared with an API, using only 1,000 randomly sampled tokens as prompt examples, the proprietary GPT-4.1-mini can deliver desirable performance (F1>0.83) on par with that from a training set of thousands of tokens in AL iterations; (2) in cases where communities prefer not to share data, 4,500-5,500 tokens selected from AL can yield reasonable results at a pace statistically significantly faster than random sampling, evidenced by growth curve modeling.

pdf bib abs
Faithful Transcription: Leveraging Bible Recordings to Improve ASR for Endangered Languages
Eric Le Ferrand | Cian Mohamed Bashar Hauser | Joshua Hartshorne | Emily Prud’hommeaux
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

While automatic speech recognition (ASR) now achieves human-level accuracy for a dozen or so languages, the majority of the world’s languages lack the resources needed to train robust ASR models. For many of these languages, the largest available source of transcribed speech data consists of recordings of the Bible. Bible recordings are appealingly large and well-structured resources, but they have notable limitations: the vocabulary and style are constrained, and the recordings are typically produced by a single speaker in a studio. These factors raise an important question: to what extent are Bible recordings useful for developing ASR models to transcribe contemporary naturalistic speech, the goal of most ASR applications? In this paper, we use Bible recordings alongside contemporary speech recordings to train ASR models in a selection of under-resourced and endangered languages. We find that models trained solely on Bible data yield shockingly weak performance when tested on contemporary everyday speech, even when compared to models trained on other (non-Bible) out-of-domain data. We identify one way of effectively leveraging Bible data in the ASR training pipeline via a two-stage training regime. Our results highlight the need to re-assess reported results relying exclusively on Bible data and to use Bible data carefully and judiciously.

2024

pdf bib abs
Enenlhet as a case-study to investigate ASR model generalizability for language documentation
Éric Le Ferrand | Raina Heaton | Emily Prud’hommeaux
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)

Although both linguists and language community members recognize the potential utility of automatic speech recognition (ASR) for documentation, one of the obstacles to using these technologies is the scarcity of data necessary to train effective systems. Recent advances in ASR, particularly the ability to fine-tune large multilingual acoustic models to small amounts of data from a new language, have demonstrated the potential of ASR for transcription. However, many proof-of-concept demonstrations of ASR in low-resource settings rely on a single data collection project, which may yield models that are biased toward that particular data scenario, whether in content, recording quality, transcription conventions, or speaker population. In this paper, we investigate the performance of two state-of-the art ASR architectures for fine-tuning acoustic models to small speech datasets with the goal of transcribing recordings of Enenlhet, an endangered Indigenous language spoken in South America. Our results suggest that while ASR offers utility for generating first-pass transcriptions of speech collected in the course of linguistic fieldwork, individual vocabulary diversity and data quality have an outsized impact on ASR accuracy.

pdf bib
Proceedings of the Second Workshop on Computation and Written Language (CAWL) @ LREC-COLING 2024
Kyle Gorman | Emily Prud'hommeaux | Brian Roark | Richard Sproat
Proceedings of the Second Workshop on Computation and Written Language (CAWL) @ LREC-COLING 2024

pdf bib abs
Automatic Transcription of Grammaticality Judgements for Language Documentation
Éric Le Ferrand | Emily Prud’hommeaux
Proceedings of the Seventh Workshop on the Use of Computational Methods in the Study of Endangered Languages

Descriptive linguistics is a sub-field of linguistics that involves the collection and annotationof language resources to describe linguistic phenomena. The transcription of these resources is often described as a tedious task, and Automatic Speech Recognition (ASR) has frequently been employed to support this process. However, the typical research approach to ASR in documentary linguistics often only captures a subset of the field’s diverse reality. In this paper, we focus specifically on one type of data known as grammaticality judgment elicitation in the context of documenting Kréyòl Gwadloupéyen. We show that only a few minutes of speech is enough to fine-tune a model originally trained in French to transcribe segments in Kréyol.

pdf bib abs
Exploring the impact of noise in low-resource ASR for Tamil
Vigneshwar Lakshminarayanan | Emily Prud’hommeaux
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

The use of deep learning algorithms has resulted in significant progress in automatic speech recognition (ASR). Robust high-accuracy ASR models typically require thousands or tens of thousands of hours of speech data, but even the strongest models tend fail under noisy conditions. Unsurprisingly, the impact of noise on accuracy is more drastic in low-resource settings. In this paper, we investigate the impact of noise on ASR in a low-resource setting. We explore novel methods for developing noise-robust ASR models using a a small dataset for Tamil, a widely-spoken but under-resourced Dravidian languages. We add various noises to the audio data to determine the impact of different kinds of noise (e.g., punctuated vs. constant, man-made vs natural) We also explore the relationship between different data augmentation methods are better suited to handling different types of noise. Our results show that all noises, regardless of the type, had an impact on ASR performance, and that upgrading the architecture alone could not mitigate the impact of noise. SpecAugment, the most common data augmentation method, was not as helpful as raw data augmentation, in which noise is explicitly added to the training data. Raw data augmentation enhances ASR performance on both clean data and noise-mixed data.

pdf bib abs
How Important is a Language Model for Low-resource ASR?
Zoey Liu | Nitin Venkateswaran | Eric Le Ferrand | Emily Prud’hommeaux
Findings of the Association for Computational Linguistics: ACL 2024

N-gram language models (LMs) are the innovation that first made large-vocabulary continuous automatic speech recognition (ASR) viable. With neural end-to-end ASR architectures, however, LMs have become an afterthought. While the effect on accuracy may be negligible for English and Mandarin, jettisoning the LM might not make sense for the world’s remaining 6000+ languages. In this paper, we investigate the role of the LM in low-resource ASR. First we ask: does using an n-gram LM in decoding in neural architectures help ASR performance? While it may seem obvious that it should, its absence in most implementations suggests otherwise. Second, we ask: when an n-gram LM is used in ASR, is there a relationship between the size of the LM and ASR accuracy? We have discovered that gut feelings on this question vary considerably, but there is little empirical work to support any particular claim. We explore these questions “in the wild” using a deliberately diverse set of 9 very small ASR corpora. The results show that: (1) decoding with an n-gram LM, regardless of its size, leads to lower word error rates; and (2) increasing the size of the LM appears to yield improvements only when the audio corpus itself is already relatively large. This suggests that collecting additional LM training text may benefit widely-spoken languages which typically have larger audio corpora. In contrast, for endangered languages where data of any kind will always be limited, efforts may be better spent collecting additional transcribed audio.

pdf bib abs
Are modern neural ASR architectures robust for polysynthetic languages?
Eric Le Ferrand | Zoey Liu | Antti Arppe | Emily Prud’hommeaux
Findings of the Association for Computational Linguistics: EMNLP 2024

Automatic speech recognition (ASR) technology is frequently proposed as a means of preservation and documentation of endangered languages, with promising results thus far. Among the endangered languages spoken today, a significant number exhibit complex morphology. The models employed in contemporary language documentation pipelines that utilize ASR, however, are predominantly based on isolating or inflectional languages, often from the Indo-European family. This raises a critical concern: building models exclusively on such languages may introduce a bias, resulting in better performance with simpler morphological structures. In this paper, we investigate the performance of modern ASR architectures on morphologically complex languages. Results indicate that modern ASR architectures appear less robust in managing high OOV rates for morphologically complex languages in terms of word error rate, while character error rates are consistently higher for isolating languages.

2023

pdf bib abs
An (unhelpful) guide to selecting the best ASR architecture for your under-resourced language
Robert Jimerson | Zoey Liu | Emily Prud’hommeaux
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Advances in deep neural models for automatic speech recognition (ASR) have yielded dramatic improvements in ASR quality for resource-rich languages, with English ASR now achieving word error rates comparable to that of human transcribers. The vast majority of the world’s languages, however, lack the quantity of data necessary to approach this level of accuracy. In this paper we use four of the most popular ASR toolkits to train ASR models for eleven languages with limited ASR training resources: eleven widely spoken languages of Africa, Asia, and South America, one endangered language of Central America, and three critically endangered languages of North America. We find that no single architecture consistently outperforms any other. These differences in performance so far do not appear to be related to any particular feature of the datasets or characteristics of the languages. These findings have important implications for future research in ASR for under-resourced languages. ASR systems for languages with abundant existing media and available speakers may derive the most benefit simply by collecting large amounts of additional acoustic and textual training data. Communities using ASR to support endangered language documentation efforts, who cannot easily collect more data, might instead focus on exploring multiple architectures and hyperparameterizations to optimize performance within the constraints of their available data and resources.

pdf bib
Studying the impact of language model size for low-resource ASR
Zoey Liu | Justin Spence | Emily Prud’hommeaux
Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages

pdf bib abs
Investigating data partitioning strategies for crosslinguistic low-resource ASR evaluation
Zoey Liu | Justin Spence | Emily Prud’hommeaux
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Many automatic speech recognition (ASR) data sets include a single pre-defined test set consisting of one or more speakers whose speech never appears in the training set. This “hold-speaker(s)-out” data partitioning strategy, however, may not be ideal for data sets in which the number of speakers is very small. This study investigates ten different data split methods for five languages with minimal ASR training resources. We find that (1) model performance varies greatly depending on which speaker is selected for testing; (2) the average word error rate (WER) across all held-out speakers is comparable not only to the average WER over multiple random splits but also to any given individual random split; (3) WER is also generally comparable when the data is split heuristically or adversarially; (4) utterance duration and intensity are comparatively more predictive factors of variability regardless of the data split. These results suggest that the widely used hold-speakers-out approach to ASR data partitioning can yield results that do not reflect model performance on unseen data or speakers. Random splits can yield more reliable and generalizable estimates when facing data sparsity.

pdf bib abs
Data-driven Parsing Evaluation for Child-Parent Interactions
Zoey Liu | Emily Prud’hommeaux
Transactions of the Association for Computational Linguistics, Volume 11

We present a syntactic dependency treebank for naturalistic child and child-directed spoken English. Our annotations largely follow the guidelines of the Universal Dependencies project (UD [Zeman et al., 2022]), with detailed extensions to lexical and syntactic structures unique to spontaneous spoken language, as opposed to written texts or prepared speech. Compared to existing UD-style spoken treebanks and other dependency corpora of child-parent interactions specifically, our dataset is much larger (44,744 utterances; 233,907 words) and contains data from 10 children covering a wide age range (18–66 months). We conduct thorough dependency parser evaluations using both graph-based and transition-based parsers, trained on three different types of out-of-domain written texts: news, tweets, and learner data. Out-of-domain parsers demonstrate reasonable performance for both child and parent data. In addition, parser performance for child data increases along children’s developmental paths, especially between 18 and 48 months, and gradually approaches the performance for parent data. These results are further validated with in-domain training.

2022

pdf bib abs
Not always about you: Prioritizing community needs when developing endangered language technology
Zoey Liu | Crystal Richardson | Richard Hatcher | Emily Prud’hommeaux
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Languages are classified as low-resource when they lack the quantity of data necessary for training statistical and machine learning tools and models. Causes of resource scarcity vary but can include poor access to technology for developing these resources, a relatively small population of speakers, or a lack of urgency for collecting such resources in bilingual populations where the second language is high-resource. As a result, the languages described as low-resource in the literature are as different as Finnish on the one hand, with millions of speakers using it in every imaginable domain, and Seneca, with only a small-handful of fluent speakers using the language primarily in a restricted domain. While issues stemming from the lack of resources necessary to train models unite this disparate group of languages, many other issues cut across the divide between widely-spoken low-resource languages and endangered languages. In this position paper, we discuss the unique technological, cultural, practical, and ethical challenges that researchers and indigenous speech community members face when working together to develop language technology to support endangered language documentation and revitalization. We report the perspectives of language teachers, Master Speakers and elders from indigenous communities, as well as the point of view of academics. We describe an ongoing fruitful collaboration and make recommendations for future partnerships between academic researchers and language community stakeholders.

pdf bib abs
Evaluating the Performance of Transformer-based Language Models for Neuroatypical Language
Duanchen Liu | Zoey Liu | Qingyun Yang | Yujing Huang | Emily Prud’hommeaux
Proceedings of the 29th International Conference on Computational Linguistics

Difficulties with social aspects of language are among the hallmarks of autism spectrum disorder (ASD). These communication differences are thought to contribute to the challenges that adults with ASD experience when seeking employment, underscoring the need for interventions that focus on improving areas of weakness in pragmatic and social language. In this paper, we describe a transformer-based framework for identifying linguistic features associated with social aspects of communication using a corpus of conversations between adults with and without ASD and neurotypical conversational partners produced while engaging in collaborative tasks. While our framework yields strong accuracy overall, performance is significantly worse for the language of participants with ASD, suggesting that they use a more diverse set of strategies for some social linguistic functions. These results, while showing promise for the development of automated language analysis tools to support targeted language interventions for ASD, also reveal weaknesses in the ability of large contextualized language models to model neuroatypical language.

pdf bib abs
Enhancing Documentation of Hupa with Automatic Speech Recognition
Zoey Liu | Justin Spence | Emily Prud’hommeaux
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages

This study investigates applications of automatic speech recognition (ASR) techniques to Hupa, a critically endangered Native American language from the Dene (Athabaskan) language family. Using around 9h12m of spoken data produced by one elder who is a first-language Hupa speaker, we experimented with different evaluation schemes and training settings. On average a fully connected deep neural network reached a word error rate of 35.26%. Our overall results illustrate the utility of ASR for making Hupa language documentation more accessible and usable. In addition, we found that when training acoustic models, using recordings with transcripts that were not carefully verified did not necessarily have a negative effect on model performance. This shows promise for speech corpora of indigenous languages that commonly include transcriptions produced by second-language speakers or linguists who have advanced knowledge in the language of interest.

The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation, and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements on several fronts that were made in the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 66 new languages, including 24 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g., missing gender and macrons information. We have amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.

pdf bib
Ninth Workshop on Speech and Language Processing for Assistive Technologies (SLPAT-2022)
Sarah Ebling | Emily Prud’hommeaux | Preethi Vaidyanathan
Ninth Workshop on Speech and Language Processing for Assistive Technologies (SLPAT-2022)

pdf bib abs
Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation
Zoey Liu | Emily Prud’hommeaux
Transactions of the Association for Computational Linguistics, Volume 10

Common designs of model evaluation typically focus on monolingual settings, where different models are compared according to their performance on a single data set that is assumed to be representative of all possible data for the task at hand. While this may be reasonable for a large data set, this assumption is difficult to maintain in low-resource scenarios, where artifacts of the data collection can yield data sets that are outliers, potentially making conclusions about model performance coincidental. To address these concerns, we investigate model generalizability in crosslinguistic low-resource scenarios. Using morphological segmentation as the test case, we compare three broad classes of models with different parameterizations, taking data from 11 languages across 6 language families. In each experimental setting, we evaluate all models on a first data set, then examine their performance consistency when introducing new randomly sampled data sets with the same size and when applying the trained models to unseen test sets of varying sizes. The results demonstrate that the extent of model generalization depends on the characteristics of the data set, and does not necessarily rely heavily on the data set size. Among the characteristics that we studied, the ratio of morpheme overlap and that of the average number of morphemes per word between the training and test sets are the two most prominent factors. Our findings suggest that future work should adopt random sampling to construct data sets with different sizes in order to make more responsible claims about model evaluation.

2021

pdf bib abs
Predicting pragmatic discourse features in the language of adults with autism spectrum disorder
Christine Yang | Duanchen Liu | Qingyun Yang | Zoey Liu | Emily Prud’hommeaux
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop

Individuals with autism spectrum disorder (ASD) experience difficulties in social aspects of communication, but the linguistic characteristics associated with deficits in discourse and pragmatic expression are often difficult to precisely identify and quantify. We are currently collecting a corpus of transcribed natural conversations produced in an experimental setting in which participants with and without ASD complete a number of collaborative tasks with their neurotypical peers. Using this dyadic conversational data, we investigate three pragmatic features – politeness, uncertainty, and informativeness – and present a dataset of utterances annotated for each of these features on a three-point scale. We then introduce ongoing work in developing and training neural models to automatically predict these features, with the goal of identifying the same between-groups differences that are observed using manual annotations. We find the best performing model for all three features is a feed-forward neural network trained with BERT embeddings. Our models yield higher accuracy than ones used in previous approaches for deriving these features, with F1 exceeding 0.82 for all three pragmatic features.

pdf bib abs
Dependency Parsing Evaluation for Low-resource Spontaneous Speech
Zoey Liu | Emily Prud’hommeaux
Proceedings of the Second Workshop on Domain Adaptation for NLP

How well can a state-of-the-art parsing system, developed for the written domain, perform when applied to spontaneous speech data involving different interlocutors? This study addresses this question in a low-resource setting using child-parent conversations from the CHILDES databse. Specifically, we focus on dependency parsing evaluation for utterances of one specific child (18 - 27 months) and her parents. We first present a semi-automatic adaption of the dependency annotation scheme in CHILDES to that of the Universal Dependencies project, an annotation style that is more commonly applied in dependency parsing. Our evaluation demonstrates that an outof-domain biaffine parser trained only on written texts performs well with parent speech. There is, however, much room for improvement on child utterances, particularly at 18 and 21 months, due to cases of omission and repetition that are prevalent in child speech. By contrast, parsers trained or fine-tuned with in-domain spoken data on a much smaller scale can achieve comparable results for parent speech and improve the weak parsing performance for child speech at these earlier ages

pdf bib abs
Morphological Segmentation for Seneca
Zoey Liu | Robert Jimerson | Emily Prud’hommeaux
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

This study takes up the task of low-resource morphological segmentation for Seneca, a critically endangered and morphologically complex Native American language primarily spoken in what is now New York State and Ontario. The labeled data in our experiments comes from two sources: one digitized from a publicly available grammar book and the other collected from informal sources. We treat these two sources as distinct domains and investigate different evaluation designs for model selection. The first design abides by standard practices and evaluate models with the in-domain development set, while the second one carries out evaluation using a development domain, or the out-of-domain development set. Across a series of monolingual and crosslinguistic training settings, our results demonstrate the utility of neural encoder-decoder architecture when coupled with multi-task learning.

This year’s iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, Võro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, Asháninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systems’ predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving >90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systems’ performance on previously unseen lemmas.

2020

pdf bib abs
Automated Scoring of Clinical Expressive Language Evaluation Tasks
Yiyi Wang | Emily Prud’hommeaux | Meysam Asgari | Jill Dolata
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications

Many clinical assessment instruments used to diagnose language impairments in children include a task in which the subject must formulate a sentence to describe an image using a specific target word. Because producing sentences in this way requires the speaker to integrate syntactic and semantic knowledge in a complex manner, responses are typically evaluated on several different dimensions of appropriateness yielding a single composite score for each response. In this paper, we present a dataset consisting of non-clinically elicited responses for three related sentence formulation tasks, and we propose an approach for automatically evaluating their appropriateness. We use neural machine translation to generate correct-incorrect sentence pairs in order to create synthetic data to increase the amount and diversity of training data for our scoring model. Our scoring model uses transfer learning to facilitate automatic sentence appropriateness evaluation. We further compare custom word embeddings with pre-trained contextualized embeddings serving as features for our scoring model. We find that transfer learning improves scoring accuracy, particularly when using pretrained contextualized embeddings.

pdf bib abs
Fully Convolutional ASR for Less-Resourced Endangered Languages
Bao Thai | Robert Jimerson | Raymond Ptucha | Emily Prud’hommeaux
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

The application of deep learning to automatic speech recognition (ASR) has yielded dramatic accuracy increases for languages with abundant training data, but languages with limited training resources have yet to see accuracy improvements on this scale. In this paper, we compare a fully convolutional approach for acoustic modelling in ASR with a variety of established acoustic modeling approaches. We evaluate our method on Seneca, a low-resource endangered language spoken in North America. Our method yields word error rates up to 40% lower than those reported using both standard GMM-HMM approaches and established deep neural methods, with a substantial reduction in training time. These results show particular promise for languages like Seneca that are both endangered and lack extensive documentation.

Despite recent advances in natural language processing and other language technology, the application of such technology to language documentation and conservation has been limited. In August 2019, a workshop was held at Carnegie Mellon University in Pittsburgh, PA, USA to attempt to bring together language community members, documentary linguists, and technologists to discuss how to bridge this gap and create prototypes of novel and practical language revitalization technologies. The workshop focused on developing technologies to aid language documentation and revitalization in four areas: 1) spoken language (speech transcription, phone to orthography decoding, text-to-speech and text-speech forced alignment), 2) dictionary extraction and management, 3) search tools for corpora, and 4) social media (language learning bots and social media analysis). This paper reports the results of this workshop, including issues discussed, and various conceived and implemented technologies for nine languages: Arapaho, Cayuga, Inuktitut, Irish Gaelic, Kidaw’ida, Kwak’wala, Ojibwe, San Juan Quiahije Chatino, and Seneca.

2019

pdf bib
Proceedings of the Eighth Workshop on Speech and Language Processing for Assistive Technologies
Heidi Christensen | Kristy Hollingshead | Emily Prud’hommeaux | Frank Rudzicz | Keith Vertanen
Proceedings of the Eighth Workshop on Speech and Language Processing for Assistive Technologies

2018

pdf bib
ASR for Documenting Acutely Under-Resourced Indigenous Languages
Robbie Jimerson | Emily Prud’hommeaux
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib abs
A dataset for identifying actionable feedback in collaborative software development
Benjamin S. Meyers | Nuthan Munaiah | Emily Prud’hommeaux | Andrew Meneely | Cecilia Ovesdotter Alm | Josephine Wolff | Pradeep K. Murukannaiah
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Software developers and testers have long struggled with how to elicit proactive responses from their coworkers when reviewing code for security vulnerabilities and errors. For a code review to be successful, it must not only identify potential problems but also elicit an active response from the colleague responsible for modifying the code. To understand the factors that contribute to this outcome, we analyze a novel dataset of more than one million code reviews for the Google Chromium project, from which we extract linguistic features of feedback that elicited responsive actions from coworkers. Using a manually-labeled subset of reviewer comments, we trained a highly accurate classifier to identify acted-upon comments (AUC = 0.85). Our results demonstrate the utility of our dataset, the feasibility of using NLP for this new task, and the potential of NLP to improve our understanding of how communications between colleagues can be authored to elicit positive, proactive responses.

pdf bib abs
SNAG: Spoken Narratives and Gaze Dataset
Preethi Vaidyanathan | Emily T. Prud’hommeaux | Jeff B. Pelz | Cecilia O. Alm
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Humans rely on multiple sensory modalities when examining and reasoning over images. In this paper, we describe a new multimodal dataset that consists of gaze measurements and spoken descriptions collected in parallel during an image inspection task. The task was performed by multiple participants on 100 general-domain images showing everyday objects and activities. We demonstrate the usefulness of the dataset by applying an existing visual-linguistic data fusion framework in order to label important image regions with appropriate linguistic labels.

pdf bib
Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic
Kate Loveys | Kate Niederhoffer | Emily Prud’hommeaux | Rebecca Resnik | Philip Resnik
Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic

2017

pdf bib abs
An Analysis and Visualization Tool for Case Study Learning of Linguistic Concepts
Cecilia Ovesdotter Alm | Benjamin Meyers | Emily Prud’hommeaux
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present an educational tool that integrates computational linguistics resources for use in non-technical undergraduate language science courses. By using the tool in conjunction with evidence-driven pedagogical case studies, we strive to provide opportunities for students to gain an understanding of linguistic concepts and analysis through the lens of realistic problems in feasible ways. Case studies tend to be used in legal, business, and health education contexts, but less in the teaching and learning of linguistics. The approach introduced also has potential to encourage students across training backgrounds to continue on to computational language analysis coursework.

pdf bib abs
Vector space models for evaluating semantic fluency in autism
Emily Prud’hommeaux | Jan van Santen | Douglas Gliner
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

A common test administered during neurological examination is the semantic fluency test, in which the patient must list as many examples of a given semantic category as possible under timed conditions. Poor performance is associated with neurological conditions characterized by impairments in executive function, such as dementia, schizophrenia, and autism spectrum disorder (ASD). Methods for analyzing semantic fluency responses at the level of detail necessary to uncover these differences have typically relied on subjective manual annotation. In this paper, we explore automated approaches for scoring semantic fluency responses that leverage ontological resources and distributional semantic models to characterize the semantic fluency responses produced by young children with and without ASD. Using these methods, we find significant differences in the semantic fluency responses of children with ASD, demonstrating the utility of using objective methods for clinical language analysis.

2016

pdf bib abs
Analyzing Gender Bias in Student Evaluations
Andamlak Terkik | Emily Prud’hommeaux | Cecilia Ovesdotter Alm | Christopher Homan | Scott Franklin
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

University students in the United States are routinely asked to provide feedback on the quality of the instruction they have received. Such feedback is widely used by university administrators to evaluate teaching ability, despite growing evidence that students assign lower numerical scores to women and people of color, regardless of the actual quality of instruction. In this paper, we analyze students’ written comments on faculty evaluation forms spanning eight years and five STEM disciplines in order to determine whether open-ended comments reflect these same biases. First, we apply sentiment analysis techniques to the corpus of comments to determine the overall affect of each comment. We then use this information, in combination with other features, to explore whether there is bias in how students describe their instructors. We show that while the gender of the evaluated instructor does not seem to affect students’ expressed level of overall satisfaction with their instruction, it does strongly influence the language that they use to describe their instructors and their experience in class.