Rebecca Knowles


2021

pdf bib
Like Chalk and Cheese? On the Effects of Translationese in MT Training
Samuel Larkin | Michel Simard | Rebecca Knowles
Proceedings of Machine Translation Summit XVIII: Research Track

We revisit the topic of translation direction in the data used for training neural machine translation systems and focusing on a real-world scenario with known translation direction and imbalances in translation direction: the Canadian Hansard. According to automatic metrics and we observe that using parallel data that was produced in the “matching” translation direction (Authentic source and translationese target) improves translation quality. In cases of data imbalance in terms of translation direction and we find that tagging of translation direction can close the performance gap. We perform a human evaluation that differs slightly from the automatic metrics and but nevertheless confirms that for this French-English dataset that is known to contain high-quality translations and authentic or tagged mixed source improves over translationese source for training.

pdf bib
NRC-CNRC Machine Translation Systems for the 2021 AmericasNLP Shared Task
Rebecca Knowles | Darlene Stewart | Samuel Larkin | Patrick Littell
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

We describe the NRC-CNRC systems submitted to the AmericasNLP shared task on machine translation. We submitted systems translating from Spanish into Wixárika, Nahuatl, Rarámuri, and Guaraní. Our best neural machine translation systems used multilingual pretraining, ensembling, finetuning, training on parts of the development data, and subword regularization. We also submitted translation memory systems as a strong baseline.

2020

pdf bib
The Indigenous Languages Technology project at NRC Canada: An empowerment-oriented approach to developing language software
Roland Kuhn | Fineen Davis | Alain Désilets | Eric Joanis | Anna Kazantseva | Rebecca Knowles | Patrick Littell | Delaney Lothian | Aidan Pine | Caroline Running Wolf | Eddie Santos | Darlene Stewart | Gilles Boulianne | Vishwa Gupta | Brian Maracle Owennatékha | Akwiratékha’ Martin | Christopher Cox | Marie-Odile Junker | Olivia Sammons | Delasie Torkornoo | Nathan Thanyehténhas Brinklow | Sara Child | Benoît Farley | David Huggins-Daines | Daisy Rosenblum | Heather Souter
Proceedings of the 28th International Conference on Computational Linguistics

This paper surveys the first, three-year phase of a project at the National Research Council of Canada that is developing software to assist Indigenous communities in Canada in preserving their languages and extending their use. The project aimed to work within the empowerment paradigm, where collaboration with communities and fulfillment of their goals is central. Since many of the technologies we developed were in response to community needs, the project ended up as a collection of diverse subprojects, including the creation of a sophisticated framework for building verb conjugators for highly inflectional polysynthetic languages (such as Kanyen’kéha, in the Iroquoian language family), release of what is probably the largest available corpus of sentences in a polysynthetic language (Inuktut) aligned with English sentences and experiments with machine translation (MT) systems trained on this corpus, free online services based on automatic speech recognition (ASR) for easing the transcription bottleneck for recordings of speech in Indigenous languages (and other languages), software for implementing text prediction and read-along audiobooks for Indigenous languages, and several other subprojects.

pdf bib
The Nunavut Hansard Inuktitut–English Parallel Corpus 3.0 with Preliminary Machine Translation Results
Eric Joanis | Rebecca Knowles | Roland Kuhn | Samuel Larkin | Patrick Littell | Chi-kiu Lo | Darlene Stewart | Jeffrey Micher
Proceedings of the 12th Language Resources and Evaluation Conference

The Inuktitut language, a member of the Inuit-Yupik-Unangan language family, is spoken across Arctic Canada and noted for its morphological complexity. It is an official language of two territories, Nunavut and the Northwest Territories, and has recognition in additional regions. This paper describes a newly released sentence-aligned Inuktitut–English corpus based on the proceedings of the Legislative Assembly of Nunavut, covering sessions from April 1999 to June 2017. With approximately 1.3 million aligned sentence pairs, this is, to our knowledge, the largest parallel corpus of a polysynthetic language or an Indigenous language of the Americas released to date. The paper describes the alignment methodology used, the evaluation of the alignments, and preliminary experiments on statistical and neural machine translation (SMT and NMT) between Inuktitut and English, in both directions.

pdf bib
NRC Systems for the 2020 Inuktitut-English News Translation Task
Rebecca Knowles | Darlene Stewart | Samuel Larkin | Patrick Littell
Proceedings of the Fifth Conference on Machine Translation

We describe the National Research Council of Canada (NRC) submissions for the 2020 Inuktitut-English shared task on news translation at the Fifth Conference on Machine Translation (WMT20). Our submissions consist of ensembled domain-specific finetuned transformer models, trained using the Nunavut Hansard and news data and, in the case of Inuktitut-English, backtranslated news and parliamentary data. In this work we explore challenges related to the relatively small amount of parallel data, morphological complexity, and domain shifts.

pdf bib
NRC Systems for Low Resource German-Upper Sorbian Machine Translation 2020: Transfer Learning with Lexical Modifications
Rebecca Knowles | Samuel Larkin | Darlene Stewart | Patrick Littell
Proceedings of the Fifth Conference on Machine Translation

We describe the National Research Council of Canada (NRC) neural machine translation systems for the German-Upper Sorbian supervised track of the 2020 shared task on Unsupervised MT and Very Low Resource Supervised MT. Our models are ensembles of Transformer models, built using combinations of BPE-dropout, lexical modifications, and backtranslation.

pdf bib
A Summary of the First Workshop on Language Technology for Language Documentation and Revitalization
Graham Neubig | Shruti Rijhwani | Alexis Palmer | Jordan MacKenzie | Hilaria Cruz | Xinjian Li | Matthew Lee | Aditi Chaudhary | Luke Gessler | Steven Abney | Shirley Anugrah Hayati | Antonios Anastasopoulos | Olga Zamaraeva | Emily Prud’hommeaux | Jennette Child | Sara Child | Rebecca Knowles | Sarah Moeller | Jeffrey Micher | Yiyuan Li | Sydney Zink | Mengzhou Xia | Roshan S Sharma | Patrick Littell
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Despite recent advances in natural language processing and other language technology, the application of such technology to language documentation and conservation has been limited. In August 2019, a workshop was held at Carnegie Mellon University in Pittsburgh, PA, USA to attempt to bring together language community members, documentary linguists, and technologists to discuss how to bridge this gap and create prototypes of novel and practical language revitalization technologies. The workshop focused on developing technologies to aid language documentation and revitalization in four areas: 1) spoken language (speech transcription, phone to orthography decoding, text-to-speech and text-speech forced alignment), 2) dictionary extraction and management, 3) search tools for corpora, and 4) social media (language learning bots and social media analysis). This paper reports the results of this workshop, including issues discussed, and various conceived and implemented technologies for nine languages: Arapaho, Cayuga, Inuktitut, Irish Gaelic, Kidaw’ida, Kwak’wala, Ojibwe, San Juan Quiahije Chatino, and Seneca.

2019

pdf bib
HABLex: Human Annotated Bilingual Lexicons for Experiments in Machine Translation
Brian Thompson | Rebecca Knowles | Xuan Zhang | Huda Khayrallah | Kevin Duh | Philipp Koehn
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Bilingual lexicons are valuable resources used by professional human translators. While these resources can be easily incorporated in statistical machine translation, it is unclear how to best do so in the neural framework. In this work, we present the HABLex dataset, designed to test methods for bilingual lexicon integration into neural machine translation. Our data consists of human generated alignments of words and phrases in machine translation test sets in three language pairs (Russian-English, Chinese-English, and Korean-English), resulting in clean bilingual lexicons which are well matched to the reference. We also present two simple baselines - constrained decoding and continued training - and an improvement to continued training to address overfitting.

2018

pdf bib
Lightweight Word-Level Confidence Estimation for Neural Interactive Translation Prediction
Rebecca Knowles | Philipp Koehn
Proceedings of the AMTA 2018 Workshop on Translation Quality Estimation and Automatic Post-Editing

pdf bib
A Comparison of Machine Translation Paradigms for Use in Black-Box Fuzzy-Match Repair
Rebecca Knowles | John Ortega | Philipp Koehn
Proceedings of the AMTA 2018 Workshop on Translation Quality Estimation and Automatic Post-Editing

pdf bib
Document-Level Adaptation for Neural Machine Translation
Sachith Sri Ram Kothur | Rebecca Knowles | Philipp Koehn
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation

It is common practice to adapt machine translation systems to novel domains, but even a well-adapted system may be able to perform better on a particular document if it were to learn from a translator’s corrections within the document itself. We focus on adaptation within a single document – appropriate for an interactive translation scenario where a model adapts to a human translator’s input over the course of a document. We propose two methods: single-sentence adaptation (which performs online adaptation one sentence at a time) and dictionary adaptation (which specifically addresses the issue of translating novel words). Combining the two models results in improvements over both approaches individually, and over baseline systems, even on short documents. On WMT news test data, we observe an improvement of +1.8 BLEU points and +23.3% novel word translation accuracy and on EMEA data (descriptions of medications) we observe an improvement of +2.7 BLEU points and +49.2% novel word translation accuracy.

pdf bib
Context and Copying in Neural Machine Translation
Rebecca Knowles | Philipp Koehn
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Neural machine translation systems with subword vocabularies are capable of translating or copying unknown words. In this work, we show that they learn to copy words based on both the context in which the words appear as well as features of the words themselves. In contexts that are particularly copy-prone, they even copy words that they have already learned they should translate. We examine the influence of context and subword features on this and other types of copying behavior.

2017

pdf bib
Six Challenges for Neural Machine Translation
Philipp Koehn | Rebecca Knowles
Proceedings of the First Workshop on Neural Machine Translation

We explore six challenges for neural machine translation: domain mismatch, amount of training data, rare words, long sentences, word alignment, and beam search. We show both deficiencies and improvements over the quality of phrase-based statistical machine translation.

pdf bib
A Rich Morphological Tagger for English: Exploring the Cross-Linguistic Tradeoff Between Morphology and Syntax
Christo Kirov | John Sylak-Glassman | Rebecca Knowles | Ryan Cotterell | Matt Post
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

A traditional claim in linguistics is that all human languages are equally expressive—able to convey the same wide range of meanings. Morphologically rich languages, such as Czech, rely on overt inflectional and derivational morphology to convey many semantic distinctions. Languages with comparatively limited morphology, such as English, should be able to accomplish the same using a combination of syntactic and contextual cues. We capitalize on this idea by training a tagger for English that uses syntactic features obtained by automatic parsing to recover complex morphological tags projected from Czech. The high accuracy of the resulting model provides quantitative confirmation of the underlying linguistic hypothesis of equal expressivity, and bodes well for future improvements in downstream HLT tasks including machine translation.

2016

pdf bib
User Modeling in Language Learning with Macaronic Texts
Adithya Renduchintala | Rebecca Knowles | Philipp Koehn | Jason Eisner
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Creating Interactive Macaronic Interfaces for Language Learning
Adithya Renduchintala | Rebecca Knowles | Philipp Koehn | Jason Eisner
Proceedings of ACL-2016 System Demonstrations

pdf bib
Demographer: Extremely Simple Name Demographics
Rebecca Knowles | Josh Carroll | Mark Dredze
Proceedings of the First Workshop on NLP and Computational Social Science

pdf bib
Analyzing Learner Understanding of Novel L2 Vocabulary
Rebecca Knowles | Adithya Renduchintala | Philipp Koehn | Jason Eisner
Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning

2014

pdf bib
I’m a Belieber: Social Roles via Self-identification and Conceptual Attributes
Charley Beller | Rebecca Knowles | Craig Harman | Shane Bergsma | Margaret Mitchell | Benjamin Van Durme
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2013

pdf bib
Topic Models and Metadata for Visualizing Text Corpora
Justin Snyder | Rebecca Knowles | Mark Dredze | Matthew Gormley | Travis Wolfe
Proceedings of the 2013 NAACL HLT Demonstration Session