Richard Sproat

Also published as: Richard W. Sproat


2021

pdf bib
The Taxonomy of Writing Systems: How to Measure How Logographic a System Is
Richard Sproat | Alexander Gutkin
Computational Linguistics, Volume 47, Issue 3 - November 2021

Taxonomies of writing systems since Gelb (1952) have classified systems based on what the written symbols represent: if they represent words or morphemes, they are logographic; if syllables, syllabic; if segments, alphabetic; and so forth. Sproat (2000) and Rogers (2005) broke with tradition by splitting the logographic and phonographic aspects into two dimensions, with logography being graded rather than a categorical distinction. A system could be syllabic, and highly logographic; or alphabetic, and mostly non-logographic. This accords better with how writing systems actually work, but neither author proposed a method for measuring logography. In this article we propose a novel measure of the degree of logography that uses an attention-based sequence-to-sequence model trained to predict the spelling of a token from its pronunciation in context. In an ideal phonographic system, the model should need to attend to only the current token in order to compute how to spell it, and this would show in the attention matrix activations. In contrast, with a logographic system, where a given pronunciation might correspond to several different spellings, the model would need to attend to a broader context. The ratio of the activation outside the token and the total activation forms the basis of our measure. We compare this with a simple lexical measure, and an entropic measure, as well as several other neural models, and argue that on balance our attention-based measure accords best with intuition about how logographic various systems are. Our work provides the first quantifiable measure of the notion of logography that accords with linguistic intuition and, we argue, provides better insight into what this notion means.

pdf bib
Structured abbreviation expansion in context
Kyle Gorman | Christo Kirov | Brian Roark | Richard Sproat
Findings of the Association for Computational Linguistics: EMNLP 2021

Ad hoc abbreviations are commonly found in informal communication channels that favor shorter messages. We consider the task of reversing these abbreviations in context to recover normalized, expanded versions of abbreviated messages. The problem is related to, but distinct from, spelling correction, as ad hoc abbreviations are intentional and can involve more substantial differences from the original words. Ad hoc abbreviations are also productively generated on-the-fly, so they cannot be resolved solely by dictionary lookup. We generate a large, open-source data set of ad hoc abbreviations. This data is used to study abbreviation strategies and to develop two strong baselines for abbreviation expansion.

2020

pdf bib
Semi-supervised URL Segmentation with Recurrent Neural Networks Pre-trained on Knowledge Graph Entities
Hao Zhang | Jae Ro | Richard Sproat
Proceedings of the 28th International Conference on Computational Linguistics

Breaking domain names such as openresearch into component words open and research is important for applications like Text-to-Speech synthesis and web search. We link this problem to the classic problem of Chinese word segmentation and show the effectiveness of a tagging model based on Recurrent Neural Networks (RNNs) using characters as input. To compensate for the lack of training data, we propose a pre-training method on concatenated entity names in a large knowledge database. Pre-training improves the model by 33% and brings the sequence accuracy to 85%.

pdf bib
NEMO: Frequentist Inference Approach to Constrained Linguistic Typology Feature Prediction in SIGTYP 2020 Shared Task
Alexander Gutkin | Richard Sproat
Proceedings of the Second Workshop on Computational Research in Linguistic Typology

This paper describes the NEMO submission to SIGTYP 2020 shared task (Bjerva et al., 2020) which deals with prediction of linguistic typological features for multiple languages using the data derived from World Atlas of Language Structures (WALS). We employ frequentist inference to represent correlations between typological features and use this representation to train simple multi-class estimators that predict individual features. We describe two submitted ridge regression-based configurations which ranked second and third overall in the constrained task. Our best configuration achieved the microaveraged accuracy score of 0.66 on 149 test languages.

2019

pdf bib
Neural Models of Text Normalization for Speech Applications
Hao Zhang | Richard Sproat | Axel H. Ng | Felix Stahlberg | Xiaochang Peng | Kyle Gorman | Brian Roark
Computational Linguistics, Volume 45, Issue 2 - June 2019

Machine learning, including neural network techniques, have been applied to virtually every domain in natural language processing. One problem that has been somewhat resistant to effective machine learning solutions is text normalization for speech applications such as text-to-speech synthesis (TTS). In this application, one must decide, for example, that 123 is verbalized as one hundred twenty three in 123 pages but as one twenty three in 123 King Ave. For this task, state-of-the-art industrial systems depend heavily on hand-written language-specific grammars.We propose neural network models that treat text normalization for TTS as a sequence-to-sequence problem, in which the input is a text token in context, and the output is the verbalization of that token. We find that the most effective model, in accuracy and efficiency, is one where the sentential context is computed once and the results of that computation are combined with the computation of each token in sequence to compute the verbalization. This model allows for a great deal of flexibility in terms of representing the context, and also allows us to integrate tagging and segmentation into the process.These models perform very well overall, but occasionally they will predict wildly inappropriate verbalizations, such as reading 3 cm as three kilometers. Although rare, such verbalizations are a major issue for TTS applications. We thus use finite-state covering grammars to guide the neural models, either during training and decoding, or just during decoding, away from such “unrecoverable” errors. Such grammars can largely be learned from data.

2018

pdf bib
Fast and Accurate Reordering with ITG Transition RNN
Hao Zhang | Axel Ng | Richard Sproat
Proceedings of the 27th International Conference on Computational Linguistics

Attention-based sequence-to-sequence neural network models learn to jointly align and translate. The quadratic-time attention mechanism is powerful as it is capable of handling arbitrary long-distance reordering, but computationally expensive. In this paper, towards making neural translation both accurate and efficient, we follow the traditional pre-reordering approach to decouple reordering from translation. We add a reordering RNN that shares the input encoder with the decoder. The RNNs are trained jointly with a multi-task loss function and applied sequentially at inference time. The task of the reordering model is to predict the permutation of the input words following the target language word order. After reordering, the attention in the decoder becomes more peaked and monotonic. For reordering, we adopt the Inversion Transduction Grammars (ITG) and propose a transition system to parse input to trees for reordering. We harness the ITG transition system with RNN. With the modeling power of RNN, we achieve superior reordering accuracy without any feature engineering. In experiments, we apply the model to the task of text normalization. Compared to a strong baseline of attention-based RNN, our ITG RNN re-ordering model can reach the same reordering accuracy with only 1/10 of the training data and is 2.5x faster in decoding.

2016

pdf bib
Keynote Lecture 2: Neural (and other Machine Learning) Approaches to Text Normalization
Richard Sproat
Proceedings of the 13th International Conference on Natural Language Processing

pdf bib
Minimally Supervised Number Normalization
Kyle Gorman | Richard Sproat
Transactions of the Association for Computational Linguistics, Volume 4

We propose two models for verbalizing numbers, a key component in speech recognition and synthesis systems. The first model uses an end-to-end recurrent neural network. The second model, drawing inspiration from the linguistics literature, uses finite-state transducers constructed with a minimal amount of training data. While both models achieve near-perfect performance, the latter model can be trained using several orders of magnitude less data than the former, making it particularly useful for low-resource languages.

pdf bib
TTS for Low Resource Languages: A Bangla Synthesizer
Alexander Gutkin | Linne Ha | Martin Jansche | Knot Pipatsrisawat | Richard Sproat
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a text-to-speech (TTS) system designed for the dialect of Bengali spoken in Bangladesh. This work is part of an ongoing effort to address the needs of under-resourced languages. We propose a process for streamlining the bootstrapping of TTS systems for under-resourced languages. First, we use crowdsourcing to collect the data from multiple ordinary speakers, each speaker recording small amount of sentences. Second, we leverage an existing text normalization system for a related language (Hindi) to bootstrap a linguistic front-end for Bangla. Third, we employ statistical techniques to construct multi-speaker acoustic models using Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) and Hidden Markov Model (HMM) approaches. We then describe our experiments that show that the resulting TTS voices score well in terms of their perceived quality as measured by Mean Opinion Score (MOS) evaluations.

2015

pdf bib
Measuring idiosyncratic interests in children with autism
Masoud Rouhizadeh | Emily Prud’hommeaux | Jan van Santen | Richard Sproat
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
Similarity Measures for Quantifying Restrictive and Repetitive Behavior in Conversations of Autistic Children
Masoud Rouhizadeh | Richard Sproat | Jan van Santen
Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality

2014

pdf bib
Applications of Lexicographic Semirings to Problems in Speech and Language Processing
Richard Sproat | Mahsa Yarmohammadi | Izhak Shafran | Brian Roark
Computational Linguistics, Volume 40, Issue 4 - December 2014

pdf bib
A Database for Measuring Linguistic Information Content
Richard Sproat | Bruno Cartoni | HyunJeong Choe | David Huynh | Linne Ha | Ravindran Rajakumar | Evelyn Wenzel-Grondie
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Which languages convey the most information in a given amount of space? This is a question often asked of linguists, especially by engineers who often have some information theoretic measure of “information” in mind, but rarely define exactly how they would measure that information. The question is, in fact remarkably hard to answer, and many linguists consider it unanswerable. But it is a question that seems as if it ought to have an answer. If one had a database of close translations between a set of typologically diverse languages, with detailed marking of morphosyntactic and morphosemantic features, one could hope to quantify the differences between how these different languages convey information. Since no appropriate database exists we decided to construct one. The purpose of this paper is to present our work on the database, along with some preliminary results. We plan to release the dataset once complete.

pdf bib
Detecting linguistic idiosyncratic interests in autism using distributional semantic models
Masoud Rouhizadeh | Emily Prud’hommeaux | Jan van Santen | Richard Sproat
Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality

pdf bib
Hippocratic Abbreviation Expansion
Brian Roark | Richard Sproat
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2013

pdf bib
Russian Stress Prediction using Maximum Entropy Ranking
Keith Hall | Richard Sproat
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

2012

pdf bib
Annotation Tools and Knowledge Representation for a Text-To-Scene System
Bob Coyne | Alex Klapheke | Masoud Rouhizadeh | Richard Sproat | Daniel Bauer
Proceedings of COLING 2012

pdf bib
Robust kaomoji detection in Twitter
Steven Bedrick | Russell Beckley | Brian Roark | Richard Sproat
Proceedings of the Second Workshop on Language in Social Media

pdf bib
Discourse-Based Modeling for AAC
Margaret Mitchell | Richard Sproat
Proceedings of the Third Workshop on Speech and Language Processing for Assistive Technologies

pdf bib
The OpenGrm open-source finite-state grammar software libraries
Brian Roark | Richard Sproat | Cyril Allauzen | Michael Riley | Jeffrey Sorensen | Terry Tai
Proceedings of the ACL 2012 System Demonstrations

2011

pdf bib
Collecting Semantic Data from Mechanical Turk for a Lexical Knowledge Resource in a Text to Picture Generating System
Masoud Rouhizadeh | Margit Bowler | Richard Sproat | Bob Coyne
Proceedings of the Ninth International Conference on Computational Semantics (IWCS 2011)

pdf bib
Towards technology-assisted co-construction with communication partners
Brian Roark | Andrew Fowler | Richard Sproat | Christopher Gibbons | Melanie Fried-Oken
Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technologies

pdf bib
Lexicographic Semirings for Exact Automata Encoding of Sequence Models
Brian Roark | Richard Sproat | Izhak Shafran
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
Last Words: Ancient Symbols, Computational Linguistics, and the Reviewing Practices of the General Science Journals
Richard Sproat
Computational Linguistics, Volume 36, Issue 3 - September 2010

pdf bib
Commentary and Discussion: Reply to Rao et al. and Lee et al.
Richard Sproat
Computational Linguistics, Volume 36, Issue 4 - December 2010

pdf bib
A Python Toolkit for Universal Transliteration
Ting Qian | Kristy Hollingshead | Su-youn Yoon | Kyoung-young Kim | Richard Sproat
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We describe ScriptTranscriber, an open source toolkit for extracting transliterations in comparable corpora from languages written in different scripts. The system includes various methods for extracting potential terms of interest from raw text, for providing guesses on the pronunciations of terms, and for comparing two strings as possible transliterations using both phonetic and temporal measures. The system works with any script in the Unicode Basic Multilingual Plane and is easily extended to include new modules. Given comparable corpora, such as newswire text, in a pair of languages that use different scripts, ScriptTranscriber provides an easy way to mine transliterations from the comparable texts. This is particularly useful for underresourced languages, where training data for transliteration may be lacking, and where it is thus hard to train good transliterators. ScriptTranscriber provides an open source package that allows for ready incorporation of more sophisticated modules ― e.g. a trained transliteration model for a particular language pair. ScriptTranscriber is available as part of the nltk contrib source tree at http://code.google.com/p/nltk/.

2009

pdf bib
Writing Systems, Transliteration and Decipherment
Kevin Knight | Richard Sproat
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Tutorial Abstracts

pdf bib
Named Entity Transcription with Pair n-Gram Models
Martin Jansche | Richard Sproat
Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration (NEWS 2009)

pdf bib
Knowing the Unseen: Estimating Vocabulary Size over Unseen Samples
Suma Bhat | Richard Sproat
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

2008

pdf bib
Book Review: Mathematical Linguistics by András Kornai
Richard Sproat | Roxana Gîrju
Computational Linguistics, Volume 34, Number 4, December 2008

2007

pdf bib
Multilingual Transliteration Using Feature based Phonetic Method
Su-Youn Yoon | Kyoung-Young Kim | Richard Sproat
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

pdf bib
Multilingual Word Sense Discrimination: A Comparative Cross-Linguistic Study
Alla Rozovskaya | Richard Sproat
Proceedings of the Workshop on Balto-Slavonic Natural Language Processing

2006

pdf bib
Named Entity Transliteration with Comparable Corpora
Richard Sproat | Tao Tao | ChengXiang Zhai
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf bib
Unsupervised Named Entity Transliteration Using Temporal and Phonetic Correlation
Tao Tao | Su-Youn Yoon | Andrew Fister | Richard Sproat | ChengXiang Zhai
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

pdf bib
Challenges in Processing Colloquial Arabic
Alla Rozovskaya | Richard Sproat | Elabbas Benmamoun
Proceedings of the International Conference on the Challenge of Arabic for NLP/MT

Processing of Colloquial Arabic is a relatively new area of research, and a number of interesting challenges pertaining to spoken Arabic dialects arise. On the one hand, a whole continuum of Arabic dialects exists, with linguistic differences on phonological, morphological, syntactic, and lexical levels. On the other hand, there are inter-dialectal similarities that need be explored. Furthermore, due to scarcity of dialect-specific linguistic resources and availability of a wide range of resources for Modern Standard Arabic (MSA), it is desirable to explore the possibility of exploiting MSA tools when working on dialects. This paper describes challenges in processing of Colloquial Arabic in the context of language modeling for Automatic Speech Recognition. Using data from Egyptian Colloquial Arabic and MSA, we investigate the question of improving language modeling of Egyptian Arabic with MSA data and resources. As part of the project, we address the problem of linguistic variation between Egyptian Arabic and MSA. To account for differences between MSA and Colloquial Arabic, we experiment with the following techniques of data transformation: morphological simplification (stemming), lexical transductions, and syntactic transformations. While the best performing model remains the one built using only dialectal data, these techniques allow us to obtain an improvement over the baseline MSA model. More specifically, while the effect on perplexity of syntactic transformations is not very significant, stemming of the training and testing data improves the baseline perplexity of the MSA model trained on words by 51%, and lexical transductions yield an 82% perplexity reduction. Although the focus of the present work is on language modeling, we believe the findings of the study will be useful for researchers involved in other areas of processing Arabic dialects, such as parsing and machine translation.

2005

pdf bib
Emotions from Text: Machine Learning for Text-based Emotion Prediction
Cecilia Ovesdotter Alm | Dan Roth | Richard Sproat
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

2004

pdf bib
Lattice-Based Search for Spoken Utterance Retrieval
Murat Saraclar | Richard Sproat
Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004

2003

pdf bib
The First International Chinese Word Segmentation Bakeoff
Richard Sproat | Thomas Emerson
Proceedings of the Second SIGHAN Workshop on Chinese Language Processing

2002

pdf bib
Creating a Finite-State Parser with Application Semantics
Owen Rambow | Srinivas Bangalore | Tahir Butt | Alexis Nasr | Richard Sproat
COLING 2002: The 17th International Conference on Computational Linguistics: Project Notes

2001

pdf bib
Book Reviews: Prosody: Theory and Experiment. Studies presented to Gosta Bruce
Chilin Shih | Richard Sproat
Computational Linguistics, Volume 27, Number 3, September 2001

1997

bib
Proceedings of the 10th Research on Computational Linguistics International Conference
Keh-Jiann Chen | Chu-Ren Huang | Richard Sproat
Proceedings of the 10th Research on Computational Linguistics International Conference

1996

pdf bib
Issues in Text-to-Speech Conversion for Mandarin
Chilin Shih | Richard Sproat
International Journal of Computational Linguistics & Chinese Language Processing, Volume 1, Number 1, August 1996

pdf bib
Estimating Lexical Priors for Low-Frequency Morphologically Ambiguous Forms
Harald Baayen | Richard Sproat
Computational Linguistics, Volume 22, Number 2, June 1996

pdf bib
A Stochastic Finite-State Word-Segmentation Algorithm for Chinese
Richard W. Sproat | Chilin Shih | William Gale | Nancy Chang
Computational Linguistics, Volume 22, Number 3, September 1996

pdf bib
Compilation of Weighted Finite-State Transducers from Decision Trees
Richard Sproat | Michael Riley
34th Annual Meeting of the Association for Computational Linguistics

pdf bib
An Efficient Compiler for Weighted Rewrite Rules
Mehryar Mohri | Richard Sproat
34th Annual Meeting of the Association for Computational Linguistics

1994

pdf bib
Weighted Rational Transductions and their Application to Human Language Processing
Fernando Pereira | Michael Riley | Richard Sproat
Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994

pdf bib
Commentary on Bird and Klein
Richard Sproat
Computational Linguistics, Volume 20, Number 3, September 1994

pdf bib
A Stochastic Finite-State Word-Segmentation Algorithm for Chinese
Richard Sproat | Chilin Shih | William Gale | Nancy Chang
32nd Annual Meeting of the Association for Computational Linguistics

1991

pdf bib
Book Reviews: PC-KIMMO: A Two-Level Processor for Morphological Analysis
Richard Sproat
Computational Linguistics, Volume 17, Number 2, June 1991

1990

pdf bib
An application of statistical optimization with dynamic programming to phonemic-input-to-character conversion for Chinese
Richard Sproat
Proceedings of Rocling III Computational Linguistics Conference III

1987

pdf bib
Constituent-Based Morphological Parsing: A New Approach to the Problem of Word-Recognition.
Richard Sproat | Barbara Brunson
25th Annual Meeting of the Association for Computational Linguistics

pdf bib
Toward Treating English Nominals Correctly
Richard W. Sproat | Mark Y. Liberman
25th Annual Meeting of the Association for Computational Linguistics