2024
pdf
bib
abs
Gramble: A Tabular Programming Language for Collaborative Linguistic Modeling
Patrick Littell
|
Darlene Stewart
|
Fineen Davis
|
Aidan Pine
|
Roland Kuhn
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We introduce Gramble, a domain-specific programming language for linguistic parsing and generation, in the tradition of XFST, TWOLC, and Kleene. Gramble features an intuitive tabular syntax and supports live group programming, allowing community experts to participate more directly in system development without having to be programmers themselves. A cross-platform interpreter is available for Windows, MacOS, and UNIX, supports collaborative programming on the web via Google Sheets, and is released open-source under the MIT license.
2021
pdf
bib
abs
NRC-CNRC Machine Translation Systems for the 2021 AmericasNLP Shared Task
Rebecca Knowles
|
Darlene Stewart
|
Samuel Larkin
|
Patrick Littell
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas
We describe the NRC-CNRC systems submitted to the AmericasNLP shared task on machine translation. We submitted systems translating from Spanish into Wixárika, Nahuatl, Rarámuri, and Guaraní. Our best neural machine translation systems used multilingual pretraining, ensembling, finetuning, training on parts of the development data, and subword regularization. We also submitted translation memory systems as a strong baseline.
2020
pdf
bib
abs
The Indigenous Languages Technology project at NRC Canada: An empowerment-oriented approach to developing language software
Roland Kuhn
|
Fineen Davis
|
Alain Désilets
|
Eric Joanis
|
Anna Kazantseva
|
Rebecca Knowles
|
Patrick Littell
|
Delaney Lothian
|
Aidan Pine
|
Caroline Running Wolf
|
Eddie Santos
|
Darlene Stewart
|
Gilles Boulianne
|
Vishwa Gupta
|
Brian Maracle Owennatékha
|
Akwiratékha’ Martin
|
Christopher Cox
|
Marie-Odile Junker
|
Olivia Sammons
|
Delasie Torkornoo
|
Nathan Thanyehténhas Brinklow
|
Sara Child
|
Benoît Farley
|
David Huggins-Daines
|
Daisy Rosenblum
|
Heather Souter
Proceedings of the 28th International Conference on Computational Linguistics
This paper surveys the first, three-year phase of a project at the National Research Council of Canada that is developing software to assist Indigenous communities in Canada in preserving their languages and extending their use. The project aimed to work within the empowerment paradigm, where collaboration with communities and fulfillment of their goals is central. Since many of the technologies we developed were in response to community needs, the project ended up as a collection of diverse subprojects, including the creation of a sophisticated framework for building verb conjugators for highly inflectional polysynthetic languages (such as Kanyen’kéha, in the Iroquoian language family), release of what is probably the largest available corpus of sentences in a polysynthetic language (Inuktut) aligned with English sentences and experiments with machine translation (MT) systems trained on this corpus, free online services based on automatic speech recognition (ASR) for easing the transcription bottleneck for recordings of speech in Indigenous languages (and other languages), software for implementing text prediction and read-along audiobooks for Indigenous languages, and several other subprojects.
pdf
bib
abs
NRC Systems for the 2020 Inuktitut-English News Translation Task
Rebecca Knowles
|
Darlene Stewart
|
Samuel Larkin
|
Patrick Littell
Proceedings of the Fifth Conference on Machine Translation
We describe the National Research Council of Canada (NRC) submissions for the 2020 Inuktitut-English shared task on news translation at the Fifth Conference on Machine Translation (WMT20). Our submissions consist of ensembled domain-specific finetuned transformer models, trained using the Nunavut Hansard and news data and, in the case of Inuktitut-English, backtranslated news and parliamentary data. In this work we explore challenges related to the relatively small amount of parallel data, morphological complexity, and domain shifts.
pdf
bib
abs
NRC Systems for Low Resource German-Upper Sorbian Machine Translation 2020: Transfer Learning with Lexical Modifications
Rebecca Knowles
|
Samuel Larkin
|
Darlene Stewart
|
Patrick Littell
Proceedings of the Fifth Conference on Machine Translation
We describe the National Research Council of Canada (NRC) neural machine translation systems for the German-Upper Sorbian supervised track of the 2020 shared task on Unsupervised MT and Very Low Resource Supervised MT. Our models are ensembles of Transformer models, built using combinations of BPE-dropout, lexical modifications, and backtranslation.
pdf
bib
abs
The Nunavut Hansard Inuktitut–English Parallel Corpus 3.0 with Preliminary Machine Translation Results
Eric Joanis
|
Rebecca Knowles
|
Roland Kuhn
|
Samuel Larkin
|
Patrick Littell
|
Chi-kiu Lo
|
Darlene Stewart
|
Jeffrey Micher
Proceedings of the Twelfth Language Resources and Evaluation Conference
The Inuktitut language, a member of the Inuit-Yupik-Unangan language family, is spoken across Arctic Canada and noted for its morphological complexity. It is an official language of two territories, Nunavut and the Northwest Territories, and has recognition in additional regions. This paper describes a newly released sentence-aligned Inuktitut–English corpus based on the proceedings of the Legislative Assembly of Nunavut, covering sessions from April 1999 to June 2017. With approximately 1.3 million aligned sentence pairs, this is, to our knowledge, the largest parallel corpus of a polysynthetic language or an Indigenous language of the Americas released to date. The paper describes the alignment methodology used, the evaluation of the alignments, and preliminary experiments on statistical and neural machine translation (SMT and NMT) between Inuktitut and English, in both directions.
2019
pdf
bib
abs
Multi-Source Transformer for Kazakh-Russian-English Neural Machine Translation
Patrick Littell
|
Chi-kiu Lo
|
Samuel Larkin
|
Darlene Stewart
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
We describe the neural machine translation (NMT) system developed at the National Research Council of Canada (NRC) for the Kazakh-English news translation task of the Fourth Conference on Machine Translation (WMT19). Our submission is a multi-source NMT taking both the original Kazakh sentence and its Russian translation as input for translating into English.
2018
pdf
bib
abs
Measuring sentence parallelism using Mahalanobis distances: The NRC unsupervised submissions to the WMT18 Parallel Corpus Filtering shared task
Patrick Littell
|
Samuel Larkin
|
Darlene Stewart
|
Michel Simard
|
Cyril Goutte
|
Chi-kiu Lo
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
The WMT18 shared task on parallel corpus filtering (Koehn et al., 2018b) challenged teams to score sentence pairs from a large high-recall, low-precision web-scraped parallel corpus (Koehn et al., 2018a). Participants could use existing sample corpora (e.g. past WMT data) as a supervisory signal to learn what a “clean” corpus looks like. However, in lower-resource situations it often happens that the target corpus of the language is the only sample of parallel text in that language. We therefore made several unsupervised entries, setting ourselves an additional constraint that we not utilize the additional clean parallel corpora. One such entry fairly consistently scored in the top ten systems in the 100M-word conditions, and for one task—translating the European Medicines Agency corpus (Tiedemann, 2009)—scored among the best systems even in the 10M-word conditions.
pdf
bib
abs
Accurate semantic textual similarity for cleaning noisy parallel corpora using semantic machine translation evaluation metric: The NRC supervised submissions to the Parallel Corpus Filtering task
Chi-kiu Lo
|
Michel Simard
|
Darlene Stewart
|
Samuel Larkin
|
Cyril Goutte
|
Patrick Littell
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
We present our semantic textual similarity approach in filtering a noisy web crawled parallel corpus using YiSi—a novel semantic machine translation evaluation metric. The systems mainly based on this supervised approach perform well in the WMT18 Parallel Corpus Filtering shared task (4th place in 100-million-word evaluation, 8th place in 10-million-word evaluation, and 6th place overall, out of 48 submissions). In fact, our best performing system—NRC-yisi-bicov is one of the only four submissions ranked top 10 in both evaluations. Our submitted systems also include some initial filtering steps for scaling down the size of the test corpus and a final redundancy removal step for better semantic and token coverage of the filtered corpus. In this paper, we also describe our unsuccessful attempt in automatically synthesizing a noisy parallel development corpus for tuning the weights to combine different parallelism and fluency features.
2017
pdf
bib
NRC Machine Translation System for WMT 2017
Chi-kiu Lo
|
Boxing Chen
|
Colin Cherry
|
George Foster
|
Samuel Larkin
|
Darlene Stewart
|
Roland Kuhn
Proceedings of the Second Conference on Machine Translation
2016
pdf
bib
NRC Russian-English Machine Translation System for WMT 2016
Chi-kiu Lo
|
Colin Cherry
|
George Foster
|
Darlene Stewart
|
Rabib Islam
|
Anna Kazantseva
|
Roland Kuhn
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers
2014
pdf
bib
abs
Coarse “split and lump” bilingual language models for richer source information in SMT
Darlene Stewart
|
Roland Kuhn
|
Eric Joanis
|
George Foster
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track
Recently, there has been interest in automatically generated word classes for improving statistical machine translation (SMT) quality: e.g, (Wuebker et al, 2013). We create new models by replacing words with word classes in features applied during decoding; we call these “coarse models”. We find that coarse versions of the bilingual language models (biLMs) of (Niehues et al, 2011) yield larger BLEU gains than the original biLMs. BiLMs provide phrase-based systems with rich contextual information from the source sentence; because they have a large number of types, they suffer from data sparsity. Niehues et al (2011) mitigated this problem by replacing source or target words with parts of speech (POSs). We vary their approach in two ways: by clustering words on the source or target side over a range of granularities (word clustering), and by clustering the bilingual units that make up biLMs (bitoken clustering). We find that loglinear combinations of the resulting coarse biLMs with each other and with coarse LMs (LMs based on word classes) yield even higher scores than single coarse models. When we add an appealing “generic” coarse configuration chosen on English > French devtest data to four language pairs (keeping the structure fixed, but providing language-pair-specific models for each pair), BLEU gains on blind test data against strong baselines averaged over 5 runs are +0.80 for English > French, +0.35 for French > English, +1.0 for Arabic > English, and +0.6 for Chinese > English.
2013
pdf
bib
Transferring markup tags in statistical machine translation: a two-stream approach
Eric Joanis
|
Darlene Stewart
|
Samuel Larkin
|
Roland Kuhn
Proceedings of the 2nd Workshop on Post-editing Technology and Practice