Patrick Littell


pdf bib
ReadAlong Studio Web Interface for Digital Interactive Storytelling
Aidan Pine | David Huggins-Daines | Eric Joanis | Patrick Littell | Marc Tessier | Delasie Torkornoo | Rebecca Knowles | Roland Kuhn | Delaney Lothian
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

We develop an interactive web-based user interface for performing textspeech alignment and creating digital interactive “read-along audio books that highlight words as they are spoken and allow users to replay individual words when clicked. We build on an existing Python library for zero-shot multilingual textspeech alignment (Littell et al., 2022), extend it by exposing its functionality through a RESTful API, and rewrite the underlying speech recognition engine to run in the browser. The ReadAlong Studio Web App is open-source, user-friendly, prioritizes privacy and data sovereignty, allows for a variety of standard export formats, and is designed to work for the majority of the world’s languages.


pdf bib
Requirements and Motivations of Low-Resource Speech Synthesis for Language Revitalization
Aidan Pine | Dan Wells | Nathan Brinklow | Patrick Littell | Korin Richmond
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper describes the motivation and development of speech synthesis systems for the purposes of language revitalization. By building speech synthesis systems for three Indigenous languages spoken in Canada, Kanien’kéha, Gitksan & SENĆOŦEN, we re-evaluate the question of how much data is required to build low-resource speech synthesis systems featuring state-of-the-art neural models. For example, preliminary results with English data show that a FastSpeech2 model trained with 1 hour of training data can produce speech with comparable naturalness to a Tacotron2 model trained with 10 hours of data. Finally, we motivate future research in evaluation and classroom integration in the field of speech synthesis for language revitalization.

pdf bib
Translation Memories as Baselines for Low-Resource Machine Translation
Rebecca Knowles | Patrick Littell
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Low-resource machine translation research often requires building baselines to benchmark estimates of progress in translation quality. Neural and statistical phrase-based systems are often used with out-of-the-box settings to build these initial baselines before analyzing more sophisticated approaches, implicitly comparing the first machine translation system to the absence of any translation assistance. We argue that this approach overlooks a basic resource: if you have parallel text, you have a translation memory. In this work, we show that using available text as a translation memory baseline against which to compare machine translation systems is simple, effective, and can shed light on additional translation challenges.

pdf bib
ReadAlong Studio: Practical Zero-Shot Text-Speech Alignment for Indigenous Language Audiobooks
Patrick Littell | Eric Joanis | Aidan Pine | Marc Tessier | David Huggins Daines | Delasie Torkornoo
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages

While the alignment of audio recordings and text (often termed “forced alignment”) is often treated as a solved problem, in practice the process of adapting an alignment system to a new, under-resourced language comes with significant challenges, requiring experience and expertise that many outside of the speech community lack. This puts otherwise “solvable” problems, like the alignment of Indigenous language audiobooks, out of reach for many real-world Indigenous language organizations. In this paper, we detail ReadAlong Studio, a suite of tools for creating and visualizing aligned audiobooks, including educational features like time-aligned highlighting, playing single words in isolation, and variable-speed playback. It is intended to be accessible to creators without an extensive background in speech or NLP, by automating or making optional many of the specialist steps in an alignment pipeline. It is well documented at a beginner-technologist level, has already been adapted to 30 languages, and can work out-of-the-box on many more languages without adaptation.


pdf bib
NRC-CNRC Machine Translation Systems for the 2021 AmericasNLP Shared Task
Rebecca Knowles | Darlene Stewart | Samuel Larkin | Patrick Littell
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

We describe the NRC-CNRC systems submitted to the AmericasNLP shared task on machine translation. We submitted systems translating from Spanish into Wixárika, Nahuatl, Rarámuri, and Guaraní. Our best neural machine translation systems used multilingual pretraining, ensembling, finetuning, training on parts of the development data, and subword regularization. We also submitted translation memory systems as a strong baseline.


pdf bib
NRC Systems for the 2020 Inuktitut-English News Translation Task
Rebecca Knowles | Darlene Stewart | Samuel Larkin | Patrick Littell
Proceedings of the Fifth Conference on Machine Translation

We describe the National Research Council of Canada (NRC) submissions for the 2020 Inuktitut-English shared task on news translation at the Fifth Conference on Machine Translation (WMT20). Our submissions consist of ensembled domain-specific finetuned transformer models, trained using the Nunavut Hansard and news data and, in the case of Inuktitut-English, backtranslated news and parliamentary data. In this work we explore challenges related to the relatively small amount of parallel data, morphological complexity, and domain shifts.

pdf bib
NRC Systems for Low Resource German-Upper Sorbian Machine Translation 2020: Transfer Learning with Lexical Modifications
Rebecca Knowles | Samuel Larkin | Darlene Stewart | Patrick Littell
Proceedings of the Fifth Conference on Machine Translation

We describe the National Research Council of Canada (NRC) neural machine translation systems for the German-Upper Sorbian supervised track of the 2020 shared task on Unsupervised MT and Very Low Resource Supervised MT. Our models are ensembles of Transformer models, built using combinations of BPE-dropout, lexical modifications, and backtranslation.

pdf bib
The Nunavut Hansard Inuktitut–English Parallel Corpus 3.0 with Preliminary Machine Translation Results
Eric Joanis | Rebecca Knowles | Roland Kuhn | Samuel Larkin | Patrick Littell | Chi-kiu Lo | Darlene Stewart | Jeffrey Micher
Proceedings of the Twelfth Language Resources and Evaluation Conference

The Inuktitut language, a member of the Inuit-Yupik-Unangan language family, is spoken across Arctic Canada and noted for its morphological complexity. It is an official language of two territories, Nunavut and the Northwest Territories, and has recognition in additional regions. This paper describes a newly released sentence-aligned Inuktitut–English corpus based on the proceedings of the Legislative Assembly of Nunavut, covering sessions from April 1999 to June 2017. With approximately 1.3 million aligned sentence pairs, this is, to our knowledge, the largest parallel corpus of a polysynthetic language or an Indigenous language of the Americas released to date. The paper describes the alignment methodology used, the evaluation of the alignments, and preliminary experiments on statistical and neural machine translation (SMT and NMT) between Inuktitut and English, in both directions.

pdf bib
AlloVera: A Multilingual Allophone Database
David R. Mortensen | Xinjian Li | Patrick Littell | Alexis Michaud | Shruti Rijhwani | Antonios Anastasopoulos | Alan W Black | Florian Metze | Graham Neubig
Proceedings of the Twelfth Language Resources and Evaluation Conference

We introduce a new resource, AlloVera, which provides mappings from 218 allophones to phonemes for 14 languages. Phonemes are contrastive phonological units, and allophones are their various concrete realizations, which are predictable from phonological context. While phonemic representations are language specific, phonetic representations (stated in terms of (allo)phones) are much closer to a universal (language-independent) transcription. AlloVera allows the training of speech recognition models that output phonetic transcriptions in the International Phonetic Alphabet (IPA), regardless of the input language. We show that a “universal” allophone model, Allosaurus, built with AlloVera, outperforms “universal” phonemic models and language-specific models on a speech-transcription task. We explore the implications of this technology (and related technologies) for the documentation of endangered and minority languages. We further explore other applications for which AlloVera will be suitable as it grows, including phonological typology.

pdf bib
The Indigenous Languages Technology project at NRC Canada: An empowerment-oriented approach to developing language software
Roland Kuhn | Fineen Davis | Alain Désilets | Eric Joanis | Anna Kazantseva | Rebecca Knowles | Patrick Littell | Delaney Lothian | Aidan Pine | Caroline Running Wolf | Eddie Santos | Darlene Stewart | Gilles Boulianne | Vishwa Gupta | Brian Maracle Owennatékha | Akwiratékha’ Martin | Christopher Cox | Marie-Odile Junker | Olivia Sammons | Delasie Torkornoo | Nathan Thanyehténhas Brinklow | Sara Child | Benoît Farley | David Huggins-Daines | Daisy Rosenblum | Heather Souter
Proceedings of the 28th International Conference on Computational Linguistics

This paper surveys the first, three-year phase of a project at the National Research Council of Canada that is developing software to assist Indigenous communities in Canada in preserving their languages and extending their use. The project aimed to work within the empowerment paradigm, where collaboration with communities and fulfillment of their goals is central. Since many of the technologies we developed were in response to community needs, the project ended up as a collection of diverse subprojects, including the creation of a sophisticated framework for building verb conjugators for highly inflectional polysynthetic languages (such as Kanyen’kéha, in the Iroquoian language family), release of what is probably the largest available corpus of sentences in a polysynthetic language (Inuktut) aligned with English sentences and experiments with machine translation (MT) systems trained on this corpus, free online services based on automatic speech recognition (ASR) for easing the transcription bottleneck for recordings of speech in Indigenous languages (and other languages), software for implementing text prediction and read-along audiobooks for Indigenous languages, and several other subprojects.

pdf bib
A Summary of the First Workshop on Language Technology for Language Documentation and Revitalization
Graham Neubig | Shruti Rijhwani | Alexis Palmer | Jordan MacKenzie | Hilaria Cruz | Xinjian Li | Matthew Lee | Aditi Chaudhary | Luke Gessler | Steven Abney | Shirley Anugrah Hayati | Antonios Anastasopoulos | Olga Zamaraeva | Emily Prud’hommeaux | Jennette Child | Sara Child | Rebecca Knowles | Sarah Moeller | Jeffrey Micher | Yiyuan Li | Sydney Zink | Mengzhou Xia | Roshan S Sharma | Patrick Littell
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Despite recent advances in natural language processing and other language technology, the application of such technology to language documentation and conservation has been limited. In August 2019, a workshop was held at Carnegie Mellon University in Pittsburgh, PA, USA to attempt to bring together language community members, documentary linguists, and technologists to discuss how to bridge this gap and create prototypes of novel and practical language revitalization technologies. The workshop focused on developing technologies to aid language documentation and revitalization in four areas: 1) spoken language (speech transcription, phone to orthography decoding, text-to-speech and text-speech forced alignment), 2) dictionary extraction and management, 3) search tools for corpora, and 4) social media (language learning bots and social media analysis). This paper reports the results of this workshop, including issues discussed, and various conceived and implemented technologies for nine languages: Arapaho, Cayuga, Inuktitut, Irish Gaelic, Kidaw’ida, Kwak’wala, Ojibwe, San Juan Quiahije Chatino, and Seneca.


pdf bib
Multi-Source Transformer for Kazakh-Russian-English Neural Machine Translation
Patrick Littell | Chi-kiu Lo | Samuel Larkin | Darlene Stewart
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

We describe the neural machine translation (NMT) system developed at the National Research Council of Canada (NRC) for the Kazakh-English news translation task of the Fourth Conference on Machine Translation (WMT19). Our submission is a multi-source NMT taking both the original Kazakh sentence and its Russian translation as input for translating into English.

pdf bib
Choosing Transfer Languages for Cross-Lingual Learning
Yu-Hsiang Lin | Chian-Yu Chen | Jean Lee | Zirui Li | Yuyan Zhang | Mengzhou Xia | Shruti Rijhwani | Junxian He | Zhisong Zhang | Xuezhe Ma | Antonios Anastasopoulos | Patrick Littell | Graham Neubig
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Cross-lingual transfer, where a high-resource transfer language is used to improve the accuracy of a low-resource task language, is now an invaluable tool for improving performance of natural language processing (NLP) on low-resource languages. However, given a particular task language, it is not clear which language to transfer from, and the standard strategy is to select languages based on ad hoc criteria, usually the intuition of the experimenter. Since a large number of features contribute to the success of cross-lingual transfer (including phylogenetic similarity, typological properties, lexical overlap, or size of available data), even the most enlightened experimenter rarely considers all these factors for the particular task at hand. In this paper, we consider this task of automatically selecting optimal transfer languages as a ranking problem, and build models that consider the aforementioned features to perform this prediction. In experiments on representative NLP tasks, we demonstrate that our model predicts good transfer languages much better than ad hoc baselines considering single features in isolation, and glean insights on what features are most informative for each different NLP tasks, which may inform future ad hoc selection even without use of our method.


pdf bib
Finite-state morphology for Kwak’wala: A phonological approach
Patrick Littell
Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages

This paper presents the phonological layer of a Kwak’wala finite-state morphological transducer, using the phonological hypotheses of Lincoln and Rath (1986) and the lenient composition operation of Karttunen (1998) to mediate the complicated relationship between underlying and surface forms. The resulting system decomposes the wide variety of surface forms in such a way that the morphological layer can be specified using unique and largely concatenative morphemes.

pdf bib
Measuring sentence parallelism using Mahalanobis distances: The NRC unsupervised submissions to the WMT18 Parallel Corpus Filtering shared task
Patrick Littell | Samuel Larkin | Darlene Stewart | Michel Simard | Cyril Goutte | Chi-kiu Lo
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

The WMT18 shared task on parallel corpus filtering (Koehn et al., 2018b) challenged teams to score sentence pairs from a large high-recall, low-precision web-scraped parallel corpus (Koehn et al., 2018a). Participants could use existing sample corpora (e.g. past WMT data) as a supervisory signal to learn what a “clean” corpus looks like. However, in lower-resource situations it often happens that the target corpus of the language is the only sample of parallel text in that language. We therefore made several unsupervised entries, setting ourselves an additional constraint that we not utilize the additional clean parallel corpora. One such entry fairly consistently scored in the top ten systems in the 100M-word conditions, and for one task—translating the European Medicines Agency corpus (Tiedemann, 2009)—scored among the best systems even in the 10M-word conditions.

pdf bib
Accurate semantic textual similarity for cleaning noisy parallel corpora using semantic machine translation evaluation metric: The NRC supervised submissions to the Parallel Corpus Filtering task
Chi-kiu Lo | Michel Simard | Darlene Stewart | Samuel Larkin | Cyril Goutte | Patrick Littell
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

We present our semantic textual similarity approach in filtering a noisy web crawled parallel corpus using YiSi—a novel semantic machine translation evaluation metric. The systems mainly based on this supervised approach perform well in the WMT18 Parallel Corpus Filtering shared task (4th place in 100-million-word evaluation, 8th place in 10-million-word evaluation, and 6th place overall, out of 48 submissions). In fact, our best performing system—NRC-yisi-bicov is one of the only four submissions ranked top 10 in both evaluations. Our submitted systems also include some initial filtering steps for scaling down the size of the test corpus and a final redundancy removal step for better semantic and token coverage of the filtered corpus. In this paper, we also describe our unsuccessful attempt in automatically synthesizing a noisy parallel development corpus for tuning the weights to combine different parallelism and fluency features.

pdf bib
Indigenous language technologies in Canada: Assessment, challenges, and successes
Patrick Littell | Anna Kazantseva | Roland Kuhn | Aidan Pine | Antti Arppe | Christopher Cox | Marie-Odile Junker
Proceedings of the 27th International Conference on Computational Linguistics

In this article, we discuss which text, speech, and image technologies have been developed, and would be feasible to develop, for the approximately 60 Indigenous languages spoken in Canada. In particular, we concentrate on technologies that may be feasible to develop for most or all of these languages, not just those that may be feasible for the few most-resourced of these. We assess past achievements and consider future horizons for Indigenous language transliteration, text prediction, spell-checking, approximate search, machine translation, speech recognition, speaker diarization, speech synthesis, optical character recognition, and computer-aided language learning.

pdf bib
Epitran: Precision G2P for Many Languages
David R. Mortensen | Siddharth Dalmia | Patrick Littell
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Parser combinators for Tigrinya and Oromo morphology
Patrick Littell | Tom McCoy | Na-Rae Han | Shruti Rijhwani | Zaid Sheikh | David Mortensen | Teruko Mitamura | Lori Levin
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)


pdf bib
Learning Language Representations for Typology Prediction
Chaitanya Malaviya | Graham Neubig | Patrick Littell
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

One central mystery of neural NLP is what neural models “know” about their subject matter. When a neural machine translation system learns to translate from one language to another, does it learn the syntax or semantics of the languages? Can this knowledge be extracted from the system to fill holes in human scientific knowledge? Existing typological databases contain relatively full feature specifications for only a few hundred languages. Exploiting the existence of parallel texts in more than a thousand languages, we build a massive many-to-one NMT system from 1017 languages into English, and use this to predict information missing from typological databases. Experiments show that the proposed method is able to infer not only syntactic, but also phonological and phonetic inventory features, and improves over a baseline that has access to information about the languages geographic and phylogenetic neighbors.

pdf bib
STREAMLInED Challenges: Aligning Research Interests with Shared Tasks
Gina-Anne Levow | Emily M. Bender | Patrick Littell | Kristen Howell | Shobhana Chelliah | Joshua Crowgey | Dan Garrette | Jeff Good | Sharon Hargus | David Inman | Michael Maxwell | Michael Tjalve | Fei Xia
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

pdf bib
Waldayu and Waldayu Mobile: Modern digital dictionary interfaces for endangered languages
Patrick Littell | Aidan Pine | Henry Davis
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

pdf bib
URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors
Patrick Littell | David R. Mortensen | Ke Lin | Katherine Kairis | Carlisle Turner | Lori Levin
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

We introduce the URIEL knowledge base for massively multilingual NLP and the lang2vec utility, which provides information-rich vector identifications of languages drawn from typological, geographical, and phylogenetic databases and normalized to have straightforward and consistent formats, naming, and semantics. The goal of URIEL and lang2vec is to enable multilingual NLP, especially on less-resourced languages and make possible types of experiments (especially but not exclusively related to NLP tasks) that are otherwise difficult or impossible due to the sparsity and incommensurability of the data sources. lang2vec vectors have been shown to reduce perplexity in multilingual language modeling, when compared to one-hot language identification vectors.


pdf bib
Polyglot Neural Language Models: A Case Study in Cross-Lingual Phonetic Representation Learning
Yulia Tsvetkov | Sunayana Sitaram | Manaal Faruqui | Guillaume Lample | Patrick Littell | David Mortensen | Alan W Black | Lori Levin | Chris Dyer
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Bridge-Language Capitalization Inference in Western Iranian: Sorani, Kurmanji, Zazaki, and Tajik
Patrick Littell | David R. Mortensen | Kartik Goyal | Chris Dyer | Lori Levin
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In Sorani Kurdish, one of the most useful orthographic features in named-entity recognition – capitalization – is absent, as the language’s Perso-Arabic script does not make a distinction between uppercase and lowercase letters. We describe a system for deriving an inferred capitalization value from closely related languages by phonological similarity, and illustrate the system using several related Western Iranian languages.

pdf bib
The Role of Context in Neural Morphological Disambiguation
Qinlan Shen | Daniel Clothiaux | Emily Tagtow | Patrick Littell | Chris Dyer
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Languages with rich morphology often introduce sparsity in language processing tasks. While morphological analyzers can reduce this sparsity by providing morpheme-level analyses for words, they will often introduce ambiguity by returning multiple analyses for the same surface form. The problem of disambiguating between these morphological parses is further complicated by the fact that a correct parse for a word is not only be dependent on the surface form but also on other words in its context. In this paper, we present a language-agnostic approach to morphological disambiguation. We address the problem of using context in morphological disambiguation by presenting several LSTM-based neural architectures that encode long-range surface-level and analysis-level contextual dependencies. We applied our approach to Turkish, Russian, and Arabic to compare effectiveness across languages, matching state-of-the-art results in two of the three languages. Our results also demonstrate that while context plays a role in learning how to disambiguate, the type and amount of context needed varies between languages.

pdf bib
Named Entity Recognition for Linguistic Rapid Response in Low-Resource Languages: Sorani Kurdish and Tajik
Patrick Littell | Kartik Goyal | David R. Mortensen | Alexa Little | Chris Dyer | Lori Levin
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

This paper describes our construction of named-entity recognition (NER) systems in two Western Iranian languages, Sorani Kurdish and Tajik, as a part of a pilot study of “Linguistic Rapid Response” to potential emergency humanitarian relief situations. In the absence of large annotated corpora, parallel corpora, treebanks, bilingual lexica, etc., we found the following to be effective: exploiting distributional regularities in monolingual data, projecting information across closely related languages, and utilizing human linguist judgments. We show promising results on both a four-month exercise in Sorani and a two-day exercise in Tajik, achieved with minimal annotation costs.

pdf bib
PanPhon: A Resource for Mapping IPA Segments to Articulatory Feature Vectors
David R. Mortensen | Patrick Littell | Akash Bharadwaj | Kartik Goyal | Chris Dyer | Lori Levin
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

This paper contributes to a growing body of evidence that—when coupled with appropriate machine-learning techniques–linguistically motivated, information-rich representations can outperform one-hot encodings of linguistic data. In particular, we show that phonological features outperform character-based models. PanPhon is a database relating over 5,000 IPA segments to 21 subsegmental articulatory features. We show that this database boosts performance in various NER-related tasks. Phonologically aware, neural CRF models built on PanPhon features are able to perform better on monolingual Spanish and Turkish NER tasks that character-based models. They have also been shown to work well in transfer models (as between Uzbek and Turkish). PanPhon features also contribute measurably to Orthography-to-IPA conversion tasks.


pdf bib
Morphological parsing of Swahili using crowdsourced lexical resources
Patrick Littell | Kaitlyn Price | Lori Levin
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We describe a morphological analyzer for the Swahili language, written in an extension of XFST/LEXC intended for the easy declaration of morphophonological patterns and importation of lexical resources. Our analyzer was supplemented extensively with data from the Kamusi Project (, a user-contributed multilingual dictionary. Making use of this resource allowed us to achieve wide lexical coverage quickly, but the heterogeneous nature of user-contributed content also poses some challenges when adapting it for use in an expert system.


pdf bib
Introducing Computational Concepts in a Linguistics Olympiad
Patrick Littell | Lori Levin | Jason Eisner | Dragomir Radev
Proceedings of the Fourth Workshop on Teaching NLP and CL