Samuel Larkin

2025

pdf bib abs
Challenges in Technical Regulatory Text Variation Detection
Shriya Vaagdevi Chikati | Samuel Larkin | David Minicola | Chi-kiu Lo
Proceedings of the 1st Regulatory NLP Workshop (RegNLP 2025)

We present a preliminary study on the feasibility of using current natural language processing techniques to detect variations between the construction codes of different jurisdictions. We formulate the task as a sentence alignment problem and evaluate various sentence representation models for their performance in this task. Our results show that task-specific trained embeddings perform marginally better than other models, but the overall accuracy remains a challenge. We also show that domain-specific fine-tuning hurts the task performance. The results highlight the challenges of developing NLP applications for technical regulatory texts.

2024

pdf bib abs
Some Tradeoffs in Continual Learning for Parliamentary Neural Machine Translation Systems
Rebecca Knowles | Samuel Larkin | Michel Simard | Marc A Tessier | Gabriel Bernier-Colborne | Cyril Goutte | Chi-kiu Lo
Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

In long-term translation projects, like Parliamentary text, there is a desire to build machine translation systems that can adapt to changes over time. We implement and examine a simple approach to continual learning for neural machine translation, exploring tradeoffs between consistency, the model’s ability to learn from incoming data, and the time a client would need to wait to obtain a newly trained translation system.

pdf bib abs
MSLC24 Submissions to the General Machine Translation Task
Samuel Larkin | Chi-Kiu Lo | Rebecca Knowles
Proceedings of the Ninth Conference on Machine Translation

The MSLC (Metric Score Landscape Challenge) submissions for English-German, English-Spanish, and Japanese-Chinese are constrained systems built using Transformer models for the purpose of better evaluating metric performance in the WMT24 Metrics Task. They are intended to be representative of the performance of systems that can be built relatively simply using constrained data and with minimal modifications to the translation training pipeline.

pdf bib abs
MSLC24: Further Challenges for Metrics on a Wide Landscape of Translation Quality
Rebecca Knowles | Samuel Larkin | Chi-Kiu Lo
Proceedings of the Ninth Conference on Machine Translation

In this second edition of the Metric Score Landscape Challenge (MSLC), we examine how automatic metrics for machine translation perform on a wide variety of machine translation output, ranging from very low quality systems to the types of high-quality systems submitted to the General MT shared task at WMT. We also explore metric results on specific types of data, such as empty strings, wrong- or mixed-language text, and more. We raise several alarms about inconsistencies in metric scores, some of which can be resolved by increasingly explicit instructions for metric use, while others highlight technical flaws.

2023

pdf bib abs
Terminology in Neural Machine Translation: A Case Study of the Canadian Hansard
Rebecca Knowles | Samuel Larkin | Marc Tessier | Michel Simard
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

Incorporating terminology into a neural machine translation (NMT) system is a feature of interest for many users of machine translation. In this case study of English-French Canadian Parliamentary text, we examine the performance of standard NMT systems at handling terminology and consider the tradeoffs between potential performance improvements and the efforts required to maintain terminological resources specifically for NMT.

pdf bib abs
Long to reign over us: A Case Study of Machine Translation and a New Monarch
Rebecca Knowles | Samuel Larkin
Findings of the Association for Computational Linguistics: ACL 2023

Novel terminology and changes in terminology are often a challenge for machine translation systems. The passing of Queen Elizabeth II and the accession of King Charles III provide a striking example of translation shift in the real world, particularly in translation contexts that have ambiguity. Examining translation between French and English, we present a focused case-study of translations about King Charles III as produced both by publicly-available MT systems and by a neural machine translation system trained specifically on Canadian parliamentary text. We find that even in cases where human translators would have adequate context to disambiguate terms from the source language, machine translation systems do not always produce the expected output. Where we are able to analyze the training data, we note that this may represent artifacts in the data, raising important questions about machine translation updates in light of real world events.

pdf bib abs
Metric Score Landscape Challenge (MSLC23): Understanding Metrics’ Performance on a Wider Landscape of Translation Quality
Chi-kiu Lo | Samuel Larkin | Rebecca Knowles
Proceedings of the Eighth Conference on Machine Translation

The Metric Score Landscape Challenge (MSLC23) dataset aims to gain insight into metric scores on a broader/wider landscape of machine translation (MT) quality. It provides a collection of low- to medium-quality MT output on the WMT23 general task test set. Together with the high quality systems submitted to the general task, this will enable better interpretation of metric scores across a range of different levels of translation quality. With this wider range of MT quality, we also visualize and analyze metric characteristics beyond just correlation.

2021

pdf bib abs
NRC-CNRC Machine Translation Systems for the 2021 AmericasNLP Shared Task
Rebecca Knowles | Darlene Stewart | Samuel Larkin | Patrick Littell
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

We describe the NRC-CNRC systems submitted to the AmericasNLP shared task on machine translation. We submitted systems translating from Spanish into Wixárika, Nahuatl, Rarámuri, and Guaraní. Our best neural machine translation systems used multilingual pretraining, ensembling, finetuning, training on parts of the development data, and subword regularization. We also submitted translation memory systems as a strong baseline.

pdf bib abs
Like Chalk and Cheese? On the Effects of Translationese in MT Training
Samuel Larkin | Michel Simard | Rebecca Knowles
Proceedings of Machine Translation Summit XVIII: Research Track

We revisit the topic of translation direction in the data used for training neural machine translation systems and focusing on a real-world scenario with known translation direction and imbalances in translation direction: the Canadian Hansard. According to automatic metrics and we observe that using parallel data that was produced in the “matching” translation direction (Authentic source and translationese target) improves translation quality. In cases of data imbalance in terms of translation direction and we find that tagging of translation direction can close the performance gap. We perform a human evaluation that differs slightly from the automatic metrics and but nevertheless confirms that for this French-English dataset that is known to contain high-quality translations and authentic or tagged mixed source improves over translationese source for training.

pdf bib abs
NRC-CNRC Systems for Upper Sorbian-German and Lower Sorbian-German Machine Translation 2021
Rebecca Knowles | Samuel Larkin
Proceedings of the Sixth Conference on Machine Translation

We describe our neural machine translation systems for the 2021 shared task on Unsupervised and Very Low Resource Supervised MT, translating between Upper Sorbian and German (low-resource) and between Lower Sorbian and German (unsupervised). The systems incorporated data filtering, backtranslation, BPE-dropout, ensembling, and transfer learning from high(er)-resource languages. As measured by automatic metrics, our systems showed strong performance, consistently placing first or tied for first across most metrics and translation directions.

2020

The Inuktitut language, a member of the Inuit-Yupik-Unangan language family, is spoken across Arctic Canada and noted for its morphological complexity. It is an official language of two territories, Nunavut and the Northwest Territories, and has recognition in additional regions. This paper describes a newly released sentence-aligned Inuktitut–English corpus based on the proceedings of the Legislative Assembly of Nunavut, covering sessions from April 1999 to June 2017. With approximately 1.3 million aligned sentence pairs, this is, to our knowledge, the largest parallel corpus of a polysynthetic language or an Indigenous language of the Americas released to date. The paper describes the alignment methodology used, the evaluation of the alignments, and preliminary experiments on statistical and neural machine translation (SMT and NMT) between Inuktitut and English, in both directions.

pdf bib abs
NRC Systems for the 2020 Inuktitut-English News Translation Task
Rebecca Knowles | Darlene Stewart | Samuel Larkin | Patrick Littell
Proceedings of the Fifth Conference on Machine Translation

We describe the National Research Council of Canada (NRC) submissions for the 2020 Inuktitut-English shared task on news translation at the Fifth Conference on Machine Translation (WMT20). Our submissions consist of ensembled domain-specific finetuned transformer models, trained using the Nunavut Hansard and news data and, in the case of Inuktitut-English, backtranslated news and parliamentary data. In this work we explore challenges related to the relatively small amount of parallel data, morphological complexity, and domain shifts.

pdf bib abs
Machine Translation Reference-less Evaluation using YiSi-2 with Bilingual Mappings of Massive Multilingual Language Model
Chi-kiu Lo | Samuel Larkin
Proceedings of the Fifth Conference on Machine Translation

We present a study on using YiSi-2 with massive multilingual pretrained language models for machine translation (MT) reference-less evaluation. Aiming at finding better semantic representation for semantic MT evaluation, we first test YiSi-2 with contextual embed- dings extracted from different layers of two different pretrained models, multilingual BERT and XLM-RoBERTa. We also experiment with learning bilingual mappings that trans- form the vector subspace of the source language to be closer to that of the target language in the pretrained model to obtain more accurate cross-lingual semantic similarity representations. Our results show that YiSi-2’s correlation with human direct assessment on translation quality is greatly improved by replacing multilingual BERT with XLM-RoBERTa and projecting the source embeddings into the tar- get embedding space using a cross-lingual lin- ear projection (CLP) matrix learnt from a small development set.

pdf bib abs
NRC Systems for Low Resource German-Upper Sorbian Machine Translation 2020: Transfer Learning with Lexical Modifications
Rebecca Knowles | Samuel Larkin | Darlene Stewart | Patrick Littell
Proceedings of the Fifth Conference on Machine Translation

We describe the National Research Council of Canada (NRC) neural machine translation systems for the German-Upper Sorbian supervised track of the 2020 shared task on Unsupervised MT and Very Low Resource Supervised MT. Our models are ensembles of Transformer models, built using combinations of BPE-dropout, lexical modifications, and backtranslation.

2019

pdf bib abs
Multi-Source Transformer for Kazakh-Russian-English Neural Machine Translation
Patrick Littell | Chi-kiu Lo | Samuel Larkin | Darlene Stewart
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

We describe the neural machine translation (NMT) system developed at the National Research Council of Canada (NRC) for the Kazakh-English news translation task of the Fourth Conference on Machine Translation (WMT19). Our submission is a multi-source NMT taking both the original Kazakh sentence and its Russian translation as input for translating into English.

2018

pdf bib
EuroGames16: Evaluating Change Detection in Online Conversation
Cyril Goutte | Yunli Wang | Fangming Liao | Zachary Zanussi | Samuel Larkin | Yuri Grinberg
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib abs
Measuring sentence parallelism using Mahalanobis distances: The NRC unsupervised submissions to the WMT18 Parallel Corpus Filtering shared task
Patrick Littell | Samuel Larkin | Darlene Stewart | Michel Simard | Cyril Goutte | Chi-kiu Lo
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

The WMT18 shared task on parallel corpus filtering (Koehn et al., 2018b) challenged teams to score sentence pairs from a large high-recall, low-precision web-scraped parallel corpus (Koehn et al., 2018a). Participants could use existing sample corpora (e.g. past WMT data) as a supervisory signal to learn what a “clean” corpus looks like. However, in lower-resource situations it often happens that the target corpus of the language is the only sample of parallel text in that language. We therefore made several unsupervised entries, setting ourselves an additional constraint that we not utilize the additional clean parallel corpora. One such entry fairly consistently scored in the top ten systems in the 100M-word conditions, and for one task—translating the European Medicines Agency corpus (Tiedemann, 2009)—scored among the best systems even in the 10M-word conditions.

pdf bib abs
Accurate semantic textual similarity for cleaning noisy parallel corpora using semantic machine translation evaluation metric: The NRC supervised submissions to the Parallel Corpus Filtering task
Chi-kiu Lo | Michel Simard | Darlene Stewart | Samuel Larkin | Cyril Goutte | Patrick Littell
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

We present our semantic textual similarity approach in filtering a noisy web crawled parallel corpus using YiSi—a novel semantic machine translation evaluation metric. The systems mainly based on this supervised approach perform well in the WMT18 Parallel Corpus Filtering shared task (4th place in 100-million-word evaluation, 8th place in 10-million-word evaluation, and 6th place overall, out of 48 submissions). In fact, our best performing system—NRC-yisi-bicov is one of the only four submissions ranked top 10 in both evaluations. Our submitted systems also include some initial filtering steps for scaling down the size of the test corpus and a final redundancy removal step for better semantic and token coverage of the filtered corpus. In this paper, we also describe our unsuccessful attempt in automatically synthesizing a noisy parallel development corpus for tuning the weights to combine different parallelism and fluency features.

2017

pdf bib abs
Cost Weighting for Neural Machine Translation Domain Adaptation
Boxing Chen | Colin Cherry | George Foster | Samuel Larkin
Proceedings of the First Workshop on Neural Machine Translation

In this paper, we propose a new domain adaptation technique for neural machine translation called cost weighting, which is appropriate for adaptation scenarios in which a small in-domain data set and a large general-domain data set are available. Cost weighting incorporates a domain classifier into the neural machine translation training algorithm, using features derived from the encoder representation in order to distinguish in-domain from out-of-domain data. Classifier probabilities are used to weight sentences according to their domain similarity when updating the parameters of the neural translation model. We compare cost weighting to two traditional domain adaptation techniques developed for statistical machine translation: data selection and sub-corpus weighting. Experiments on two large-data tasks show that both the traditional techniques and our novel proposal lead to significant gains, with cost weighting outperforming the traditional methods.