Brian Roark - ACL Anthology

Brian Roark

2025

Improving Informally Romanized Language Identification
Adrian Benton | Alexander Gutkin | Christo Kirov | Brian Roark
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

The Latin script is often used to informally write languages with non-Latin native scripts. In many cases (e.g., most languages in India), the lack of conventional spelling in the Latin script results in high spelling variability. Such romanization renders languages that are normally easily distinguished due to being written in different scripts – Hindi and Urdu, for example – highly confusable. In this work, we increase language identification (LID) accuracy for romanized text by improving the methods used to synthesize training sets. We find that training on synthetic samples which incorporate natural spelling variation yields higher LID system accuracy than including available naturally occurring examples in the training set, or even training higher capacity models. We demonstrate new state-of-the-art LID performance on romanized text from 20 Indic languages in the Bhasha-Abhijnaanam evaluation set (Madhani et al., 2023a), improving test F1 from the reported 74.7% (using a pretrained neural model) to 85.4% using a linear classifier trained solely on synthetic data and 88.2% when also training on available harvested text.

2024

Proceedings of the Second Workshop on Computation and Written Language (CAWL) @ LREC-COLING 2024
Kyle Gorman | Emily Prud'hommeaux | Brian Roark | Richard Sproat
Proceedings of the Second Workshop on Computation and Written Language (CAWL) @ LREC-COLING 2024

Abbreviation Across the World’s Languages and Scripts
Kyle Gorman | Brian Roark
Proceedings of the Second Workshop on Computation and Written Language (CAWL) @ LREC-COLING 2024

Detailed taxonomies for non-standard words, including abbreviations, have been developed for speech and language processing, though mostly with reference to English. In this paper, we examine abbreviation formation strategies in a diverse sample of more than 50 languages, dialects and scripts. The resulting taxonomy—and data about which strategies are attested in which languages—provides key information needed to create multilingual systems for abbreviation expansion, an essential component for speech processing and text understanding

Context-aware Transliteration of Romanized South Asian Languages
Christo Kirov | Cibu Johny | Anna Katanova | Alexander Gutkin | Brian Roark
Computational Linguistics, Volume 50, Issue 2 - June 2023

While most transliteration research is focused on single tokens such as named entities—for example, transliteration of from the Gujarati script to the Latin script “Ahmedabad” footnoteThe most populous city in the Indian state of Gujarat. the informal romanization prevalent in South Asia and elsewhere often requires transliteration of full sentences. The lack of large parallel text collections of full sentence (as opposed to single word) transliterations necessitates incorporation of contextual information into transliteration via non-parallel resources, such as via mono-script text collections. In this article, we present a number of methods for improving transliteration in context for such a use scenario. Some of these methods in fact improve performance without making use of sentential context, allowing for better quantification of the degree to which contextual information in particular is responsible for system improvements. Our final systems, which ultimately rely upon ensembles including large pretrained language models fine-tuned on simulated parallel data, yield substantial improvements over the best previously reported results for full sentence transliteration from Latin to native script on all 12 languages in the Dakshina dataset (Roark et al. 2020), with an overall 3.3% absolute (18.6% relative) mean word-error rate reduction.

2023

Proceedings of the Workshop on Computation and Written Language (CAWL 2023)
Kyle Gorman | Richard Sproat | Brian Roark
Proceedings of the Workshop on Computation and Written Language (CAWL 2023)

Distinguishing Romanized Hindi from Romanized Urdu
Elizabeth Nielsen | Christo Kirov | Brian Roark
Proceedings of the Workshop on Computation and Written Language (CAWL 2023)

We examine the task of distinguishing between Hindi and Urdu when those languages are romanized, i.e., written in the Latin script. Both languages are widely informally romanized, and to the extent that they are identified in the Latin script by language identification systems, they are typically conflated. In the absence of large labeled collections of such text, we consider methods for generating training data. Beginning with a small set of seed words, each of which are strongly indicative of one of the languages versus the other, we prompt a pretrained large language model (LLM) to generate romanized text. Treating text generated from an Urdu prompt as one class and text generated from a Hindi prompt as the other class, we build a binary language identification (LangID) classifier. We demonstrate that the resulting classifier distinguishes manually romanized Urdu Wikipedia text from manually romanized Hindi Wikipedia text far better than chance. We use this classifier to estimate the prevalence of Urdu in a large collection of text labeled as romanized Hindi that has been used to train large language models. These techniques can be applied to bootstrap classifiers in other cases where a dataset is known to contain multiple distinct but related classes, such as different dialects of the same language, but for which labels cannot easily be obtained.

Spelling convention sensitivity in neural language models
Elizabeth Nielsen | Christo Kirov | Brian Roark
Findings of the Association for Computational Linguistics: EACL 2023

We examine whether large neural language models, trained on very large collections of varied English text, learn the potentially long-distance dependency of British versus American spelling conventions, i.e., whether spelling is consistently one or the other within model-generated strings. In contrast to long-distance dependencies in non-surface underlying structure (e.g., syntax), spelling consistency is easier to measure both in LMs and the text corpora used to train them, which can provide additional insight into certain observed model behaviors. Using a set of probe words unique to either British or American English, we first establish that training corpora exhibit substantial (though not total) consistency. A large T5 language model does appear to internalize this consistency, though only with respect to observed lexical items (not nonce words with British/American spelling patterns). We further experiment with correcting for biases in the training data by fine-tuning T5 on synthetic data that has been debiased, and find that finetuned T5 remains only somewhat sensitive to spelling consistency. Further experiments show GPT2 to be similarly limited.

Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) — languages for which NLP research is particularly far behind in meeting user needs — it is feasible to annotate small amounts of data. Motivated by this, we propose XTREME-UP, a benchmark defined by: its focus on the scarce-data scenario rather than zero-shot; its focus on user-centric tasks — tasks with broad adoption by speakers of high-resource languages; and its focus on under-represented languages where this scarce-data scenario tends to be most realistic. XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies including ASR, OCR, MT, and information access tasks that are of general utility. We create new datasets for OCR, autocomplete, semantic parsing, and transliteration, and build on and refine existing datasets for other tasks. XTREME-UP provides methodology for evaluating many modeling scenarios including text only, multi-modal (vision, audio, and text), supervised parameter tuning, and in-context learning. We evaluate commonly used models on the benchmark. We release all code and scripts to train and evaluate models.

2022

Extensions to Brahmic script processing within the Nisaba library: new scripts, languages and utilities
Alexander Gutkin | Cibu Johny | Raiomond Doctor | Lawrence Wolf-Sonkin | Brian Roark
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The Brahmic family of scripts is used to record some of the most spoken languages in the world and is arguably the most diverse family of writing systems. In this work, we present several substantial extensions to Brahmic script functionality within the open-source Nisaba library of finite-state script normalization and processing utilities (Johny et al., 2021). First, we extend coverage from the original ten scripts to an additional ten scripts of South Asia and beyond, including some used to record endangered languages such as Dogri. Second, we augment the language layer so that scripts used by multiple languages in distinct ways can be processed correctly for more languages, such as the Bengali script when used for the low-resource language Santali. We document key changes to the finite-state engine required to support these new languages and scripts. Finally, we add new script processing utilities, including lightweight script-level reading normalization that (unlike existing visual normalization) does not preserve visual invariance, and a fixed-input transliteration mechanism specifically tailored to Brahmic text entry with ASCII characters.

Criteria for Useful Automatic Romanization in South Asian Languages
Isin Demirsahin | Cibu Johny | Alexander Gutkin | Brian Roark
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper presents a number of possible criteria for systems that transliterate South Asian languages from their native scripts into the Latin script, a process known as romanization. These criteria are related to either fidelity to human linguistic behavior (pronunciation transparency, naturalness and conventionality) or processing utility for people (ease of input) as well as under-the-hood in systems (invertibility and stability across languages and scripts). When addressing these differing criteria several linguistic considerations, such as modeling of prominent phonological processes and their relation to orthography, need to be taken into account. We discuss these key linguistic details in the context of Brahmic scripts and languages that use them, such as Hindi and Malayalam. We then present the core features of several romanization algorithms, implemented in a finite state transducer (FST) formalism, that address differing criteria. Implementations of these algorithms have been released as part of the Nisaba finite-state script processing library.

Design principles of an open-source language modeling microservice package for AAC text-entry applications
Brian Roark | Alexander Gutkin
Ninth Workshop on Speech and Language Processing for Assistive Technologies (SLPAT-2022)

We present MozoLM, an open-source language model microservice package intended for use in AAC text-entry applications, with a particular focus on the design principles of the library. The intent of the library is to allow the ensembling of multiple diverse language models without requiring the clients (user interface designers, system users or speech-language pathologists) to attend to the formats of the models. Issues around privacy, security, dynamic versus static models, and methods of model combination are explored and specific design choices motivated. Some simulation experiments demonstrating the benefits of personalized language model ensembling via the library are presented.

Beyond Arabic: Software for Perso-Arabic Script Manipulation
Alexander Gutkin | Cibu Johny | Raiomond Doctor | Brian Roark | Richard Sproat
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

This paper presents an open-source software library that provides a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script. The operations include various levels of script normalization, including visual invariance-preserving operations that subsume and go beyond the standard Unicode normalization forms, as well as transformations that modify the visual appearance of characters in accordance with the regional orthographies for eleven contemporary languages from diverse language families. The library also provides simple FST-based romanization and transliteration. We additionally attempt to formalize the typology of Perso-Arabic characters by providing one-to-many mappings from Unicode code points to the languages that use them. While our work focuses on the Arabic script diaspora rather than Arabic itself, this approach could be adopted for any language that uses the Arabic script, thus providing a unified framework for treating a script family used by close to a billion people.

2021

Approximating Probabilistic Models as Weighted Finite Automata
Ananda Theertha Suresh | Brian Roark | Michael Riley | Vlad Schogol
Computational Linguistics, Volume 47, Issue 2 - June 2021

Weighted finite automata (WFAs) are often used to represent probabilistic models, such as n-gram language models, because among other things, they are efficient for recognition tasks in time and space. The probabilistic source to be represented as a WFA, however, may come in many forms. Given a generic probabilistic model over sequences, we propose an algorithm to approximate it as a WFA such that the Kullback-Leibler divergence between the source model and the WFA target model is minimized. The proposed algorithm involves a counting step and a difference of convex optimization step, both of which can be performed efficiently. We demonstrate the usefulness of our approach on various tasks, including distilling n-gram models from neural models, building compact language models, and building open-vocabulary character models. The algorithms used for these experiments are available in an open-source software library.

Disambiguatory Signals are Stronger in Word-initial Positions
Tiago Pimentel | Ryan Cotterell | Brian Roark
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Psycholinguistic studies of human word processing and lexical access provide ample evidence of the preferred nature of word-initial versus word-final segments, e.g., in terms of attention paid by listeners (greater) or the likelihood of reduction by speakers (lower). This has led to the conjecture—as in Wedel et al. (2019b), but common elsewhere—that languages have evolved to provide more information earlier in words than later. Information-theoretic methods to establish such tendencies in lexicons have suffered from several methodological shortcomings that leave open the question of whether this high word-initial informativeness is actually a property of the lexicon or simply an artefact of the incremental nature of recognition. In this paper, we point out the confounds in existing methods for comparing the informativeness of segments early in the word versus later in the word, and present several new measures that avoid these confounds. When controlling for these confounds, we still find evidence across hundreds of languages that indeed there is a cross-linguistic tendency to front-load information in words.

Finite-state script normalization and processing utilities: The Nisaba Brahmic library
Cibu Johny | Lawrence Wolf-Sonkin | Alexander Gutkin | Brian Roark
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

This paper presents an open-source library for efficient low-level processing of ten major South Asian Brahmic scripts. The library provides a flexible and extensible framework for supporting crucial operations on Brahmic scripts, such as NFC, visual normalization, reversible transliteration, and validity checks, implemented in Python within a finite-state transducer formalism. We survey some common Brahmic script issues that may adversely affect the performance of downstream NLP tasks, and provide the rationale for finite-state design and system implementation details.

Structured abbreviation expansion in context
Kyle Gorman | Christo Kirov | Brian Roark | Richard Sproat
Findings of the Association for Computational Linguistics: EMNLP 2021

Ad hoc abbreviations are commonly found in informal communication channels that favor shorter messages. We consider the task of reversing these abbreviations in context to recover normalized, expanded versions of abbreviated messages. The problem is related to, but distinct from, spelling correction, as ad hoc abbreviations are intentional and can involve more substantial differences from the original words. Ad hoc abbreviations are also productively generated on-the-fly, so they cannot be resolved solely by dictionary lookup. We generate a large, open-source data set of ad hoc abbreviations. This data is used to study abbreviation strategies and to develop two strong baselines for abbreviation expansion.

Finding Concept-specific Biases in Form–Meaning Associations
Tiago Pimentel | Brian Roark | Søren Wichmann | Ryan Cotterell | Damián Blasi
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

This work presents an information-theoretic operationalisation of cross-linguistic non-arbitrariness. It is not a new idea that there are small, cross-linguistic associations between the forms and meanings of words. For instance, it has been claimed (Blasi et al., 2016) that the word for “tongue” is more likely than chance to contain the phone [l]. By controlling for the influence of language family and geographic proximity within a very large concept-aligned, cross-lingual lexicon, we extend methods previously used to detect within language non-arbitrariness (Pimentel et al., 2019) to measure cross-linguistic associations. We find that there is a significant effect of non-arbitrariness, but it is unsurprisingly small (less than 0.5% on average according to our information-theoretic estimate). We also provide a concept-level analysis which shows that a quarter of the concepts considered in our work exhibit a significant level of cross-linguistic non-arbitrariness. In sum, the paper provides new methods to detect cross-linguistic associations at scale, and confirms their effects are minor.

2020

Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset
Brian Roark | Lawrence Wolf-Sonkin | Christo Kirov | Sabrina J. Mielke | Cibu Johny | Isin Demirsahin | Keith Hall
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet. We document the methods used for preparation and selection of the Wikipedia text in each language; collection of attested romanizations for sampled lexicons; and manual romanization of held-out sentences from the native script collections. We additionally provide baseline results on several tasks made possible by the dataset, including single word transliteration, full sentence transliteration, and language modeling of native script and romanized text.

Transactions of the Association for Computational Linguistics, Volume 8
Mark Johnson | Brian Roark | Ani Nenkova
Transactions of the Association for Computational Linguistics, Volume 8

Phonotactic Complexity and Its Trade-offs
Tiago Pimentel | Brian Roark | Ryan Cotterell
Transactions of the Association for Computational Linguistics, Volume 8

We present methods for calculating a measure of phonotactic complexity—bits per phoneme— that permits a straightforward cross-linguistic comparison. When given a word, represented as a sequence of phonemic segments such as symbols in the international phonetic alphabet, and a statistical model trained on a sample of word types from the language, we can approximately measure bits per phoneme using the negative log-probability of that word under the model. This simple measure allows us to compare the entropy across languages, giving insight into how complex a language’s phonotactics is. Using a collection of 1016 basic concept words across 106 languages, we demonstrate a very strong negative correlation of − 0.74 between bits per phoneme and the average length of words.

2019

Neural Models of Text Normalization for Speech Applications
Hao Zhang | Richard Sproat | Axel H. Ng | Felix Stahlberg | Xiaochang Peng | Kyle Gorman | Brian Roark
Computational Linguistics, Volume 45, Issue 2 - June 2019

Machine learning, including neural network techniques, have been applied to virtually every domain in natural language processing. One problem that has been somewhat resistant to effective machine learning solutions is text normalization for speech applications such as text-to-speech synthesis (TTS). In this application, one must decide, for example, that 123 is verbalized as one hundred twenty three in 123 pages but as one twenty three in 123 King Ave. For this task, state-of-the-art industrial systems depend heavily on hand-written language-specific grammars. We propose neural network models that treat text normalization for TTS as a sequence-to-sequence problem, in which the input is a text token in context, and the output is the verbalization of that token. We find that the most effective model, in accuracy and efficiency, is one where the sentential context is computed once and the results of that computation are combined with the computation of each token in sequence to compute the verbalization. This model allows for a great deal of flexibility in terms of representing the context, and also allows us to integrate tagging and segmentation into the process. These models perform very well overall, but occasionally they will predict wildly inappropriate verbalizations, such as reading 3 cm as three kilometers. Although rare, such verbalizations are a major issue for TTS applications. We thus use finite-state covering grammars to guide the neural models, either during training and decoding, or just during decoding, away from such “unrecoverable” errors. Such grammars can largely be learned from data.

Meaning to Form: Measuring Systematicity as Information
Tiago Pimentel | Arya D. McCarthy | Damian Blasi | Brian Roark | Ryan Cotterell
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

A longstanding debate in semiotics centers on the relationship between linguistic signs and their corresponding semantics: is there an arbitrary relationship between a word form and its meaning, or does some systematic phenomenon pervade? For instance, does the character bigram ‘gl’ have any systematic relationship to the meaning of words like ‘glisten’, ‘gleam’ and ‘glow’? In this work, we offer a holistic quantification of the systematicity of the sign using mutual information and recurrent neural networks. We employ these in a data-driven and massively multilingual approach to the question, examining 106 languages. We find a statistically significant reduction in entropy when modeling a word form conditioned on its semantic representation. Encouragingly, we also recover well-attested English examples of systematic affixes. We conclude with the meta-point: Our approximate effect size (measured in bits) is quite small—despite some amount of systematicity between form and meaning, an arbitrary relationship and its resulting benefits dominate human language.

What Kind of Language Is Hard to Language-Model?
Sabrina J. Mielke | Ryan Cotterell | Kyle Gorman | Brian Roark | Jason Eisner
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

How language-agnostic are current state-of-the-art NLP tools? Are there some types of language that are easier to model with current methods? In prior work (Cotterell et al., 2018) we attempted to address this question for language modeling, and observed that recurrent neural network language models do not perform equally well over all the high-resource European languages found in the Europarl corpus. We speculated that inflectional morphology may be the primary culprit for the discrepancy. In this paper, we extend these earlier experiments to cover 69 languages from 13 language families using a multilingual Bible corpus. Methodologically, we introduce a new paired-sample multiplicative mixed-effects model to obtain language difficulty coefficients from at-least-pairwise parallel corpora. In other words, the model is aware of inter-sentence variation and can handle missing data. Exploiting this model, we show that “translationese” is not any easier to model than natively written language in a fair comparison. Trying to answer the question of what features difficult languages have in common, we try and fail to reproduce our earlier (Cotterell et al., 2018) observation about morphological complexity and instead reveal far simpler statistics of the data that seem to drive complexity in a much larger sample.

Transactions of the Association for Computational Linguistics, Volume 7
Lillian Lee | Mark Johnson | Brian Roark | Ani Nenkova
Transactions of the Association for Computational Linguistics, Volume 7

Distilling weighted finite automata from arbitrary probabilistic models
Ananda Theertha Suresh | Brian Roark | Michael Riley | Vlad Schogol
Proceedings of the 14th International Conference on Finite-State Methods and Natural Language Processing

Weighted finite automata (WFA) are often used to represent probabilistic models, such as n-gram language models, since they are efficient for recognition tasks in time and space. The probabilistic source to be represented as a WFA, however, may come in many forms. Given a generic probabilistic model over sequences, we propose an algorithm to approximate it as a weighted finite automaton such that the Kullback-Leibler divergence between the source model and the WFA target model is minimized. The proposed algorithm involves a counting step and a difference of convex optimization, both of which can be performed efficiently. We demonstrate the usefulness of our approach on some tasks including distilling n-gram models from neural models.

Latin script keyboards for South Asian languages with finite-state normalization
Lawrence Wolf-Sonkin | Vlad Schogol | Brian Roark | Michael Riley
Proceedings of the 14th International Conference on Finite-State Methods and Natural Language Processing

The use of the Latin script for text entry of South Asian languages is common, even though there is no standard orthography for these languages in the script. We explore several compact finite-state architectures that permit variable spellings of words during mobile text entry. We find that approaches making use of transliteration transducers provide large accuracy improvements over baselines, but that simpler approaches involving a compact representation of many attested alternatives yields much of the accuracy gain. This is particularly important when operating under constraints on model size (e.g., on inexpensive mobile devices with limited storage and memory for keyboard models), and on speed of inference, since people typing on mobile keyboards expect no perceptual delay in keyboard responsiveness.

Rethinking Phonotactic Complexity
Tiago Pimentel | Brian Roark | Ryan Cotterell
Proceedings of the 2019 Workshop on Widening NLP

In this work, we propose the use of phone-level language models to estimate phonotactic complexity—measured in bits per phoneme—which makes cross-linguistic comparison straightforward. We compare the entropy across languages using this simple measure, gaining insight on how complex different language’s phonotactics are. Finally, we show a very strong negative correlation between phonotactic complexity and the average length of words—Spearman rho=-0.744—when analysing a collection of 106 languages with 1016 basic concepts each.

2018

Are All Languages Equally Hard to Language-Model?
Ryan Cotterell | Sabrina J. Mielke | Jason Eisner | Brian Roark
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

For general modeling methods applied to diverse languages, a natural question is: how well should we expect our models to work on languages with differing typological profiles? In this work, we develop an evaluation framework for fair cross-linguistic comparison of language models, using translated text so that all models are asked to predict approximately the same information. We then conduct a study on 21 languages, demonstrating that in some languages, the textual expression of the information is harder to predict with both n-gram and LSTM language models. We show complex inflectional morphology to be a cause of performance differences among languages.

Transactions of the Association for Computational Linguistics, Volume 6
Lillian Lee | Mark Johnson | Kristina Toutanova | Brian Roark
Transactions of the Association for Computational Linguistics, Volume 6

2017

Transliterated Mobile Keyboard Input via Weighted Finite-State Transducers
Lars Hellsten | Brian Roark | Prasoon Goyal | Cyril Allauzen | Françoise Beaufays | Tom Ouyang | Michael Riley | David Rybach
Proceedings of the 13th International Conference on Finite State Methods and Natural Language Processing (FSMNLP 2017)

2016

Distributed representation and estimation of WFST-based n-gram models
Cyril Allauzen | Michael Riley | Brian Roark
Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata

2015

Graph-Based Word Alignment for Clinical Language Evaluation
Emily Prud’hommeaux | Brian Roark
Computational Linguistics, Volume 41, Issue 4 - December 2015

2014

Data Driven Grammatical Error Detection in Transcripts of Children’s Speech
Eric Morley | Anna Eva Hallin | Brian Roark
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Applications of Lexicographic Semirings to Problems in Speech and Language Processing
Richard Sproat | Mahsa Yarmohammadi | Izhak Shafran | Brian Roark
Computational Linguistics, Volume 40, Issue 4 - December 2014

Hippocratic Abbreviation Expansion
Brian Roark | Richard Sproat
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Transforming trees into hedges and parsing with “hedgebank” grammars
Mahsa Yarmohammadi | Aaron Dunlop | Brian Roark
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Challenges in Automating Maze Detection
Eric Morley | Anna Eva Hallin | Brian Roark
Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality

2013

Pair Language Models for Deriving Alternative Pronunciations and Spellings from Pronunciation Dictionaries
Russell Beckley | Brian Roark
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

Discriminative Joint Modeling of Lexical Variation and Acoustic Confusion for Automated Narrative Retelling Assessment
Maider Lehr | Izhak Shafran | Emily Prud’hommeaux | Brian Roark
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Distributional semantic models for the evaluation of disordered language
Masoud Rouhizadeh | Emily Prud’hommeaux | Brian Roark | Jan van Santen
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Smoothed marginal distribution constraints for language modeling
Brian Roark | Cyril Allauzen | Michael Riley
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The Utility of Manual and Automatic Linguistic Error Codes for Identifying Neurodevelopmental Disorders
Eric Morley | Brian Roark | Jan van Santen
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

Proceedings of the Fourth Workshop on Speech and Language Processing for Assistive Technologies
Jan Alexandersson | Peter Ljunglöf | Kathleen F. McCoy | François Portet | Brian Roark | Frank Rudzicz | Michel Vacher
Proceedings of the Fourth Workshop on Speech and Language Processing for Assistive Technologies

2012

Finite-State Chart Constraints for Reduced Complexity Context-Free Parsing Pipelines
Brian Roark | Kristy Hollingshead | Nathan Bodenstab
Computational Linguistics, Volume 38, Issue 4 - December 2012

The OpenGrm open-source finite-state grammar software libraries
Brian Roark | Richard Sproat | Cyril Allauzen | Michael Riley | Jeffrey Sorensen | Terry Tai
Proceedings of the ACL 2012 System Demonstrations

Robust kaomoji detection in Twitter
Steven Bedrick | Russell Beckley | Brian Roark | Richard Sproat
Proceedings of the Second Workshop on Language in Social Media

Graph-based alignment of narratives for automated neurological assessment
Emily Prud’hommeaux | Brian Roark
BioNLP: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing

Proceedings of the Third Workshop on Speech and Language Processing for Assistive Technologies
Jan Alexandersson | Peter Ljunglöf | Kathleen F. McCoy | Brian Roark | Annalu Waller
Proceedings of the Third Workshop on Speech and Language Processing for Assistive Technologies

2011

Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation
Zhifei Li | Ziyuan Wang | Jason Eisner | Sanjeev Khudanpur | Brian Roark
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

Beam-Width Prediction for Efficient Context-Free Parsing
Nathan Bodenstab | Aaron Dunlop | Keith Hall | Brian Roark
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

Lexicographic Semirings for Exact Automata Encoding of Sequence Models
Brian Roark | Richard Sproat | Izhak Shafran
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

Semi-Supervised Modeling for Prenominal Modifier Ordering
Margaret Mitchell | Aaron Dunlop | Brian Roark
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

Unary Constraints for Efficient Context-Free Parsing
Nathan Bodenstab | Kristy Hollingshead | Brian Roark
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

An ERP-based Brain-Computer Interface for text entry using Rapid Serial Visual Presentation and Language Modeling
Kenneth Hild | Umut Orhan | Deniz Erdogmus | Brian Roark | Barry Oken | Shalini Purwar | Hooman Nezamfar | Melanie Fried-Oken
Proceedings of the ACL-HLT 2011 System Demonstrations

Classification of Atypical Language in Autism
Emily T. Prud’hommeaux | Brian Roark | Lois M. Black | Jan van Santen
Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics

Towards technology-assisted co-construction with communication partners
Brian Roark | Andrew Fowler | Richard Sproat | Christopher Gibbons | Melanie Fried-Oken
Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technologies

Asynchronous fixed-grid scanning with dynamic codes
Russ Beckley | Brian Roark
Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technologies

Efficient Matrix-Encoded Grammars and Low Latency Parallelization Strategies for CYK
Aaron Dunlop | Nathan Bodenstab | Brian Roark
Proceedings of the 12th International Conference on Parsing Technologies

2010

Prenominal Modifier Ordering via Multiple Sequence Alignment
Aaron Dunlop | Margaret Mitchell | Brian Roark
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Proceedings of the NAACL HLT 2010 Workshop on Speech and Language Processing for Assistive Technologies
Melanie Fried-Oken | Kathleen F. McCoy | Brian Roark
Proceedings of the NAACL HLT 2010 Workshop on Speech and Language Processing for Assistive Technologies

Scanning methods and language modeling for binary switch typing
Brian Roark | Jacques de Villiers | Christopher Gibbons | Melanie Fried-Oken
Proceedings of the NAACL HLT 2010 Workshop on Speech and Language Processing for Assistive Technologies

Demo Session Abstracts
Brian Roark
Proceedings of the NAACL HLT 2010 Workshop on Speech and Language Processing for Assistive Technologies

2009

Deriving lexical and syntactic expectation-based measures for psycholinguistic modeling via incremental top-down parsing
Brian Roark | Asaf Bachrach | Carlos Cardenas | Christophe Pallier
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

Linear Complexity Context-Free Parsing Pipelines via Chart Constraints
Brian Roark | Kristy Hollingshead
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Tutorial Abstracts
Ciprian Chelba | Paul Kantor | Brian Roark
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Tutorial Abstracts

Proceedings of the ACL-IJCNLP 2009 Student Research Workshop
Brian Roark | Grace Ngai | Davis Muhajereen D. Dimalen | Jenny Rose Finkel | Blaise Thomson
Proceedings of the ACL-IJCNLP 2009 Student Research Workshop

2008

Classifying Chart Cells for Quadratic Complexity Context-Free Inference
Brian Roark | Kristy Hollingshead
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

2007

Book Reviews: Putting Linguistics into Speech Recognition: The Regulus Grammar Compiler, by Manny Rayner, Beth Ann Hockey, and Pierette Bouillon
Brian Roark
Computational Linguistics, Volume 33, Number 2, June 2007

Pipeline Iteration
Kristy Hollingshead | Brian Roark
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

The utility of parse-derived features for automatic discourse segmentation
Seeger Fisher | Brian Roark
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

Syntactic complexity measures for detecting Mild Cognitive Impairment
Brian Roark | Margaret Mitchell | Kristy Hollingshead
Biological, translational, and clinical language processing

2006

While both spoken and written language processing stand to benefit from parsing, the standard Parseval metrics (Black et al., 1991) and their canonical implementation (Sekine and Collins, 1997) are only useful for text. The Parseval metrics are undefined when the words input to the parser do not match the words in the gold standard parse tree exactly, and word errors are unavoidable with automatic speech recognition (ASR) systems. To fill this gap, we have developed a publicly available tool for scoring parses that implements a variety of metrics which can handle mismatches in words and segmentations, including: alignment-based bracket evaluation, alignment-based dependency evaluation, and a dependency evaluation that does not require alignment. We describe the different metrics, how to use the tool, and the outcome of an extensive set of experiments on the sensitivity.

Probabilistic Context-Free Grammar Induction Based on Structural Zeros
Mehryar Mohri | Brian Roark
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference

PCFGs with Syntactic and Prosodic Indicators of Speech Repairs
John Hale | Izhak Shafran | Lisa Yung | Bonnie J. Dorr | Mary Harper | Anna Krasnyanskaya | Matthew Lease | Yang Liu | Brian Roark | Matthew Snover | Robin Stewart
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

2005

Comparing and Combining Finite-State and Context-Free Parsers
Kristy Hollingshead | Seeger Fisher | Brian Roark
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

Discriminative Syntactic Language Modeling for Speech Recognition
Michael Collins | Brian Roark | Murat Saraclar
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05)

2004

Language Model Adaptation with MAP Estimation and the Perceptron Algorithm
Michiel Bacchiani | Brian Roark | Murat Saraclar
Proceedings of HLT-NAACL 2004: Short Papers

Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm
Brian Roark | Murat Saraclar | Michael Collins | Mark Johnson
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

Incremental Parsing with the Perceptron Algorithm
Michael Collins | Brian Roark
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

Efficient Incremental Beam-Search Parsing with Generative and Discriminative Models
Brian Roark
Proceedings of the Workshop on Incremental Parsing: Bringing Engineering and Cognition Together

2003

Supervised and unsupervised PCFG adaptation to novel domains
Brian Roark | Michiel Bacchiani
Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics

Generalized Algorithms for Constructing Statistical Language Models
Cyril Allauzen | Mehryar Mohri | Brian Roark
Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics

2002

Markov Parsing: Lattice Rescoring with a Statistical Parser
Brian Roark
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics

2001

Probabilistic Top-Down Parsing and Language Modeling
Brian Roark
Computational Linguistics, Volume 27, Number 2, June 2001

2000

Compact non-left-recursive grammars using the selective left-corner transform and factoring
Mark Johnson | Brian Roark
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics

Measuring Efficiency in High-accuracy, Broad-coverage Statistical Parsing
Brian Roark | Eugene Charniak
Proceedings of the COLING-2000 Workshop on Efficiency In Large-Scale Parsing Systems

1999

Efficient probabilistic top-down and left-corner parsing
Brian Roark | Mark Johnson
Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics

1998

Noun-phrase co-occurrence statistics for semi-automatic semantic lexicon construction
Brian Roark | Eugene Charniak
COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics

Noun-Phrase Co-occurrence Statistics for Semi-Automatic Semantic Lexicon Construction
Brian Roark | Eugene Charniak
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 2

Co-authors

Christo Kirov 7

Michael Riley 7

Emily Prud’hommeaux 6

Cyril Allauzen 5

Tiago Pimentel 5

Izhak Shafran 5

Nathan Bodenstab 4

Eugene Charniak 4

Melanie Fried-Oken 4

Lawrence Wolf-Sonkin 4

Russell Beckley 3

Michael Collins 3

Kathleen F. McCoy 3

Sabrina J. Mielke 3

Margaret Mitchell 3

Murat Saraclar 3

Jan van Santen 3

Jan Alexandersson 2

Michiel Bacchiani 2

Isin Demirsahin 2

Raiomond Doctor 2

Seeger Fisher 2

Christopher Gibbons 2

Anna Eva Hallin 2

Anna Katanova 2

Anna Krasnyanskaya 2

Matthew Lease 2

Yang Liu (刘扬) 2

Peter Ljunglöf 2

Mehryar Mohri 2

Elizabeth Nielsen 2

Matthew Snover 2

Robin Stewart 2

Ananda Theertha Suresh 2

Mahsa Yarmohammadi 2

David Ifeoluwa Adelani 1

Asaf Bachrach 1

Françoise Beaufays 1

Steven Bedrick 1

Adrian Benton 1

Lois M. Black 1

Carlos Cardenas 1

Isaac Caswell 1

Ciprian Chelba 1

Jonathan H. Clark 1

Dana L. Dickinson 1

Davis Muhajereen D. Dimalen 1

Deniz Erdogmus 1

Jenny Rose Finkel 1

Andrew Fowler 1

Prasoon Goyal 1

Lars Hellsten 1

Melvin Johnson 1

Jeremy G. Kahn 1

Sanjeev Khudanpur 1

Arya D. McCarthy 1

Hooman Nezamfar 1

Massimo Nicosia 1

Mari Ostendorf 1

Christophe Pallier 1

Dmitry Panteleev 1

Xiaochang Peng 1

François Portet 1

Shalini Purwar 1

Shruti Rijhwani 1

Masoud Rouhizadeh 1

Sebastian Ruder 1

Frank Rudzicz 1

Bidisha Samanta 1

Jean-Michel A- Sarr 1

Jeffrey Sorensen 1

Felix Stahlberg 1

Partha Talukdar 1

Blaise Thomson 1

Kristina Toutanova 1

Michel Vacher 1

Annalu Waller 1

Søren Wichmann 1

Jacques de Villiers 1

Venues