Casey Kennington

Also published as: Casey Redd Kennington

2025

pdf bib abs
Learning to Speak Like a Child: Reinforcing and Evaluating a Child-level Generative Language Model
Enoch Levandovsky | Anna Manaseryan | Casey Kennington
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue

A language model that can generate utterances that are appraised as being within a specific age of a young child who is beginning their language learning journey can be useful in scenarios where child-level language is needed, for example in virtual avatars, interactions with individuals who have disabilities, or developmental robotics. In this paper, we focus on an age range that is not represented in prior work: emergent speakers. We use the CHILDES database to train and tune language models of different parameter sizes using a group relative policy optimization reinforcement learning regime. Our goal is to find the most coherent, yet child-like language model while keeping the number of parameters to as few as possible. We evaluate using metrics of coherency, “toddlerality,” and an evaluation using human subjects who interact with two robot platforms. Our experiments show that even small language models (under 1 billion parameters) can be used effectively to generate child-like utterances.

pdf bib abs
rrSDS 2.0: Incremental, Modular, Distributed, Multimodal Spoken Dialogue with Robotic Platforms
Anna Manaseryan | Porter Rigby | Brooke Matthews | Catherine Henry | Josue Torres-Fonseca | Ryan Whetten | Enoch Levandovsky | Casey Kennington
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue

This demo will showcase updates made to the ‘robot-ready spoken dialogue system’ built on the Retico framework. Updates include new modules, logging and real-time monitoring tools, integrations with the Coppelia Sim virtual robot platfrom, integrations with a benchmark, improved documentation, and pypi environment usage.

2024

pdf bib abs
Conceptual Pacts for Reference Resolution Using Small, Dynamically Constructed Language Models: A Study in Puzzle Building Dialogues
Julian Hough | Sina Zarrieß | Casey Kennington | David Schlangen | Massimo Poesio
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Using Brennan and Clark’s theory of a Conceptual Pact, that when interlocutors agree on a name for an object, they are forming a temporary agreement on how to conceptualize that object, we present an extension to a simple reference resolver which simulates this process over time with different conversation pairs. In a puzzle construction domain, we model pacts with small language models for each referent which update during the interaction. When features from these pact models are incorporated into a simple bag-of-words reference resolver, the accuracy increases compared to using a standard pre-trained model. The model performs equally to a competitor using the same data but with exhaustive re-training after each prediction, while also being more transparent, faster and less resource-intensive. We also experiment with reducing the number of training interactions, and can still achieve reference resolution accuracies of over 80% in testing from observing a single previous interaction, over 20% higher than a pre-trained baseline. While this is a limited domain, we argue the model could be applicable to larger real-world applications in human and human-robot interaction and is an interpretable and transparent model.

pdf bib abs
Incorporating Word-level Phonemic Decoding into Readability Assessment
Christine Pinney | Casey Kennington | Maria Soledad Pera | Katherine Landau Wright | Jerry Alan Fails
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Current approaches in automatic readability assessment have found success with the use of large language models and transformer architectures. These techniques lead to accuracy improvement, but they do not offer the interpretability that is uniquely required by the audience most often employing readability assessment tools: teachers and educators. Recent work that employs more traditional machine learning methods has highlighted the linguistic importance of considering semantic and syntactic characteristics of text in readability assessment by utilizing handcrafted feature sets. Research in Education suggests that, in addition to semantics and syntax, phonetic and orthographic instruction are necessary for children to progress through the stages of reading and spelling development; children must first learn to decode the letters and symbols on a page to recognize words and phonemes and their connection to speech sounds. Here, we incorporate this word-level phonemic decoding process into readability assessment by crafting a phonetically-based feature set for grade-level classification for English. Our resulting feature set shows comparable performance to much larger, semantically- and syntactically-based feature sets, supporting the linguistic value of orthographic and phonetic considerations in readability assessment.

pdf bib abs
Understanding Survey Paper Taxonomy about Large Language Models via Graph Representation Learning
Jun Zhuang | Casey Kennington
Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)

As new research on Large Language Models (LLMs) continues, it is difficult to keep up with new research and models. To help researchers synthesize the new research many have written survey papers, but even those have become numerous. In this paper, we develop a method to automatically assign survey papers to a taxonomy. We collect the metadata of 144 LLM survey papers and explore three paradigms to classify papers within the taxonomy. Our work indicates that leveraging graph structure information on co-category graphs can significantly outperform the language models in two paradigms; pre-trained language models’ fine-tuning and zero-shot/few-shot classifications using LLMs. We find that our model surpasses an average human recognition level and that fine-tuning LLMs using weak labels generated by a smaller model, such as the GCN in this study, can be more effective than using ground-truth labels, revealing the potential of weak-to-strong generalization in the taxonomy classification task.

2023

pdf bib abs
Evaluating and Improving Automatic Speech Recognition using Severity
Ryan Whetten | Casey Kennington
Proceedings of the 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks

A common metric for evaluating Automatic Speech Recognition (ASR) is Word Error Rate (WER) which solely takes into account discrepancies at the word-level. Although useful, WER is not guaranteed to correlate well with human judgment or performance on downstream tasks that use ASR. Meaningful assessment of ASR mistakes becomes even more important in high-stake scenarios such as health-care. We propose 2 general measures to evaluate the severity of mistakes made by ASR systems, one based on sentiment analysis and another based on text embeddings. We evaluate these measures on simulated patient-doctor conversations using 5 ASR systems. Results show that these measures capture characteristics of ASR errors that WER does not. Furthermore, we train an ASR system incorporating severity and demonstrate the potential for using severity not only in the evaluation, but in the development of ASR. Advantages and limitations of this methodology are analyzed and discussed.

pdf bib abs
Exploring Transformers as Compact, Data-efficient Language Models
Clayton Fields | Casey Kennington
Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)

Large scale transformer models, trained with massive datasets have become the standard in natural language processing. The huge size of most transformers make research with these models impossible for those with limited computational resources. Additionally, the enormous pretraining data requirements of transformers exclude pretraining them with many smaller datasets that might provide enlightening results. In this study, we show that transformers can be significantly reduced in size, with as few as 5.7 million parameters, and still retain most of their downstream capability. Further we show that transformer models can retain comparable results when trained on human-scale datasets, as few as 5 million words of pretraining data. Overall, the results of our study suggest transformers function well as compact, data efficient language models and that complex model compression methods, such as model distillation are not necessarily superior to pretraining reduced size transformer models from scratch.

pdf bib
Tiny Language Models Enriched with Multimodal Knowledge from Multiplex Networks
Clayton Fields | Osama Natouf | Andrew McMains | Catherine Henry | Casey Kennington
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

pdf bib
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Svetlana Stoyanchev | Shafiq Joty | David Schlangen | Ondrej Dusek | Casey Kennington | Malihe Alikhani
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue

2022

pdf bib abs
HADREB: Human Appraisals and (English) Descriptions of Robot Emotional Behaviors
Josue Torres-Fonseca | Casey Kennington
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Humans sometimes anthropomorphize everyday objects, but especially robots that have human-like qualities and that are often able to interact with and respond to humans in ways that other objects cannot. Humans especially attribute emotion to robot behaviors, partly because humans often use and interpret emotions when interacting with other humans, and they apply that capability when interacting with robots. Moreover, emotions are a fundamental part of the human language system and emotions are used as scaffolding for language learning, making them an integral part of language learning and meaning. However, there are very few datasets that explore how humans perceive the emotional states of robots and how emotional behaviors relate to human language. To address this gap we have collected HADREB, a dataset of human appraisals and English descriptions of robot emotional behaviors collected from over 30 participants. These descriptions and human emotion appraisals are collected using the Mistyrobotics Misty II and the Digital Dream Labs Cozmo (formerly Anki) robots. The dataset contains English descriptions and emotion appraisals of more than 500 descriptions and graded valence labels of 8 emotion pairs for each behavior and each robot. In this paper we describe the process of collecting and cleaning the data, give a general analysis of the data, and evaluate the usefulness of the dataset in two experiments, one using a language model to map descriptions to emotions, the other maps robot behaviors to emotions.

pdf bib abs
Symbol and Communicative Grounding through Object Permanence with a Mobile Robot
Josue Torres-Fonseca | Catherine Henry | Casey Kennington
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

Object permanence is the ability to form and recall mental representations of objects even when they are not in view. Despite being a crucial developmental step for children, object permanence has had only some exploration as it relates to symbol and communicative grounding in spoken dialogue systems. In this paper, we leverage SLAM as a module for tracking object permanence and use a robot platform to move around a scene where it discovers objects and learns how they are denoted. We evaluated by comparing our system’s effectiveness at learning words from human dialogue partners both with and without object permanence. We found that with object permanence, human dialogue partners spoke with the robot and the robot correctly identified objects it had learned about significantly more than without object permanence, which suggests that object permanence helped facilitate communicative and symbol grounding.

2021

pdf bib abs
Enriching Language Models with Visually-grounded Word Vectors and the Lancaster Sensorimotor Norms
Casey Kennington
Proceedings of the 25th Conference on Computational Natural Language Learning

Language models are trained only on text despite the fact that humans learn their first language in a highly interactive and multimodal environment where the first set of learned words are largely concrete, denoting physical entities and embodied states. To enrich language models with some of this missing experience, we leverage two sources of information: (1) the Lancaster Sensorimotor norms, which provide ratings (means and standard deviations) for over 40,000 English words along several dimensions of embodiment, and which capture the extent to which something is experienced across 11 different sensory modalities, and (2) vectors from coefficients of binary classifiers trained on images for the BERT vocabulary. We pre-trained the ELECTRA model and fine-tuned the RoBERTa model with these two sources of information then evaluate using the established GLUE benchmark and the Visual Dialog benchmark. We find that enriching language models with the Lancaster norms and image vectors improves results in both tasks, with some implications for robust language models that capture holistic linguistic meaning in a language learning context.

pdf bib abs
Spellchecking for Children in Web Search: a Natural Language Interface Case-study
Casey Kennington | Jerry Alan Fails | Katherine Landau Wright | Maria Soledad Pera
Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing

Given the more widespread nature of natural language interfaces, it is increasingly important to understand who are accessing those interfaces, and how those interfaces are being used. In this paper, we explore spellchecking in the context of web search with children as the target audience. In particular, via a literature review we show that, while widely used, popular search tools are ill-designed for children. We then use spellcheckers as a case study to highlight the need for an interdisciplinary approach that brings together natural language processing, education, human-computer interaction to address a known information retrieval problem: query misspelling. We conclude that it is imperative that those for whom the interfaces are designed have a voice in the design process.

pdf bib abs
Incremental Unit Networks for Multimodal, Fine-grained Information State Representation
Casey Kennington | David Schlangen
Proceedings of the 1st Workshop on Multimodal Semantic Representations (MMSR)

We offer a fine-grained information state annotation scheme that follows directly from the Incremental Unit abstract model of dialogue processing when used within a multimodal, co-located, interactive setting. We explain the Incremental Unit model and give an example application using the Localized Narratives dataset, then offer avenues for future research.

pdf bib abs
Natural Language Processing for Computer Scientists and Data Scientists at a Large State University
Casey Kennington
Proceedings of the Fifth Workshop on Teaching NLP

The field of Natural Language Processing (NLP) changes rapidly, requiring course offerings to adjust with those changes, and NLP is not just for computer scientists; it’s a field that should be accessible to anyone who has a sufficient background. In this paper, I explain how students with Computer Science and Data Science backgrounds can be well-prepared for an upper-division NLP course at a large state university. The course covers probability and information theory, elementary linguistics, machine and deep learning, with an attempt to balance theoretical ideas and concepts with practical applications. I explain the course objectives, topics and assignments, reflect on adjustments to the course over the last four years, as well as feedback from students.

2020

pdf bib abs
Evaluating and Improving Child-Directed Automatic Speech Recognition
Eric Booth | Jake Carns | Casey Kennington | Nader Rafla
Proceedings of the Twelfth Language Resources and Evaluation Conference

Speech recognition has seen dramatic improvements in the last decade, though those improvements have focused primarily on adult speech. In this paper, we assess child-directed speech recognition and leverage a transfer learning approach to improve child-directed speech recognition by training the recent DeepSpeech2 model on adult data, then apply additional tuning to varied amounts of child speech data. We evaluate our model using the CMU Kids dataset as well as our own recordings of child-directed prompts. The results from our experiment show that even a small amount of child audio data improves significantly over a baseline of adult-only or child-only trained models. We report a final general Word-Error-Rate of 29% over a baseline of 62% that uses the adult-trained model. Our analyses show that our model adapts quickly using a small amount of data and that the general child model works better than school grade-specific models. We make available our trained model and our data collection tool.

For help with their spelling errors, children often turn to spellcheckers integrated in software applications like word processors and search engines. However, existing spellcheckers are usually tuned to the needs of traditional users (i.e., adults) and generally prove unsatisfactory for children. Motivated by this issue, we introduce KidSpell, an English spellchecker oriented to the spelling needs of children. KidSpell applies (i) an encoding strategy for mapping both misspelled words and spelling suggestions to their phonetic keys and (ii) a selection process that prioritizes candidate spelling suggestions that closely align with the misspelled word based on their respective keys. To assess the effectiveness of, we compare the model’s performance against several popular, mainstream spellcheckers in a number of offline experiments using existing and novel datasets. The results of these experiments show that KidSpell outperforms existing spellcheckers, as it accurately prioritizes relevant spelling corrections when handling misspellings generated by children in both essay writing and online search tasks. As a byproduct of our study, we create two new datasets comprised of spelling errors generated by children from hand-written essays and web search inquiries, which we make available to the research community.

pdf bib abs
Learning Word Groundings from Humans Facilitated by Robot Emotional Displays
David McNeill | Casey Kennington
Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue

In working towards accomplishing a human-level acquisition and understanding of language, a robot must meet two requirements: the ability to learn words from interactions with its physical environment, and the ability to learn language from people in settings for language use, such as spoken dialogue. In a live interactive study, we test the hypothesis that emotional displays are a viable solution to the cold-start problem of how to communicate without relying on language the robot does not–indeed, cannot–yet know. We explain our modular system that can autonomously learn word groundings through interaction and show through a user study with 21 participants that emotional displays improve the quantity and quality of the inputs provided to the robot.

pdf bib abs
rrSDS: Towards a Robot-ready Spoken Dialogue System
Casey Kennington | Daniele Moro | Lucas Marchand | Jake Carns | David McNeill
Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Spoken interaction with a physical robot requires a dialogue system that is modular, multimodal, distributive, incremental and temporally aligned. In this demo paper, we make significant contributions towards fulfilling these requirements by expanding upon the ReTiCo incremental framework. We outline the incremental and multimodal modules and how their computation can be distributed. We demonstrate the power and flexibility of our robot-ready spoken dialogue system to be integrated with almost any robot.

2018

pdf bib abs
Predicting Perceived Age: Both Language Ability and Appearance are Important
Sarah Plane | Ariel Marvasti | Tyler Egan | Casey Kennington
Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue

When interacting with robots in a situated spoken dialogue setting, human dialogue partners tend to assign anthropomorphic and social characteristics to those robots. In this paper, we explore the age and educational level that human dialogue partners assign to three different robotic systems, including an un-embodied spoken dialogue system. We found that how a robot speaks is as important to human perceptions as the way the robot looks. Using the data from our experiment, we derived prosodic, emotional, and linguistic features from the participants to train and evaluate a classifier that predicts perceived intelligence, age, and education level.

2016

PentoRef is a corpus of task-oriented dialogues collected in systematically manipulated settings. The corpus is multilingual, with English and German sections, and overall comprises more than 20000 utterances. The dialogues are fully transcribed and annotated with referring expressions mapped to objects in corresponding visual scenes, which makes the corpus a rich resource for research on spoken referring expressions in generation and resolution. The corpus includes several sub-corpora that correspond to different dialogue situations where parameters related to interactivity, visual access, and verbal channel have been manipulated in systematic ways. The corpus thus lends itself to very targeted studies of reference in spontaneous dialogue.

pdf bib
Resolving References to Objects in Photographs using the Words-As-Classifiers Model
David Schlangen | Sina Zarrieß | Casey Kennington
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
OpenDial: A Toolkit for Developing Spoken Dialogue Systems with Probabilistic Rules
Pierre Lison | Casey Kennington
Proceedings of ACL-2016 System Demonstrations

pdf bib
Real-Time Understanding of Complex Discriminative Scene Descriptions
Ramesh Manuvinakurike | Casey Kennington | David DeVault | David Schlangen
Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

pdf bib
Supporting Spoken Assistant Systems with a Graphical User Interface that Signals Incremental Understanding and Prediction State
Casey Kennington | David Schlangen
Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Suffix trees are data structures that can be used to index a corpus. In this paper, we explore how some properties of suffix trees naturally provide the functionality of an n-gram language model with variable n. We explain these properties of suffix trees, which we leverage for our Suffix Tree Language Model (STLM) implementation and explain how a suffix tree implicitly contains the data needed for n-gram language modeling. We also discuss the kinds of smoothing techniques appropriate to such a model. We then show that our suffix-tree language model implementation is competitive when compared to the state-of-the-art language model SRILM (Stolke, 2002) in statistical machine translation experiments.

pdf bib
Markov Logic Networks for Situated Incremental Natural Language Understanding
Casey Kennington | David Schlangen
Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue

2011

pdf bib
DFKI Hybrid Machine Translation System for WMT 2011 - On the Integration of SMT and RBMT
Jia Xu | Hans Uszkoreit | Casey Kennington | David Vilar | Xiaojun Zhang
Proceedings of the Sixth Workshop on Statistical Machine Translation

2008

pdf bib abs
Elicited Imitation as an Oral Proficiency Measure with ASR Scoring
C. Ray Graham | Deryle Lonsdale | Casey Kennington | Aaron Johnson | Jeremiah McGhee
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper discusses development and evaluation of a practical, valid and reliable instrument for evaluating the spoken language abilities of second-language (L2) learners of English. First we sketch the theory and history behind elicited imitation (EI) tests and the renewed interest in them. Then we present how we developed a new test based on various language resources, and administered it to a few hundred students of varying levels. The students were also scored using standard evaluation techniques, and the EI results were compared to more traditionally derived scores. We also sketch how we developed a new integrated tool that allows the session recordings of the EI data to be analyzed with a widely-used automatic speech recognition (ASR) engine. We discuss the promising results of the ASR engines processing of these files and how they correlated with human scoring of the same items. We indicate how the integrated tool will be used in the future. Further development plans and prospects for follow-on work round out the discussion.