Cyril Goutte

Also published as: C. Goutte

2024

Some Tradeoffs in Continual Learning for Parliamentary Neural Machine Translation Systems
Rebecca Knowles | Samuel Larkin | Michel Simard | Marc A Tessier | Gabriel Bernier-Colborne | Cyril Goutte | Chi-kiu Lo
Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

In long-term translation projects, like Parliamentary text, there is a desire to build machine translation systems that can adapt to changes over time. We implement and examine a simple approach to continual learning for neural machine translation, exploring tradeoffs between consistency, the model’s ability to learn from incoming data, and the time a client would need to wait to obtain a newly trained translation system.

2023

pdf bib abs

Dialect and Variant Identification as a Multi-Label Classification Task: A Proposal Based on Near-Duplicate Analysis
Gabriel Bernier-colborne | Cyril Goutte | Serge Leger
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

We argue that dialect identification should be treated as a multi-label classification problem rather than the single-class setting prevalent in existing collections and evaluations. In order to avoid extensive human re-labelling of the data, we propose an analysis of ambiguous near-duplicates in an existing collection covering four variants of French.We show how this analysis helps us provide multiple labels for a significant subset of the original data, therefore enriching the annotation with minimal human intervention. The resulting data can then be used to train dialect identifiers in a multi-label setting. Experimental results show that on the enriched dataset, the multi-label classifier produces similar accuracy to the single-label classifier on test cases that are unambiguous (single label), but it increases the macro-averaged F1-score by 0.225 absolute (71% relative gain) on ambiguous texts with multiple labels. On the original data, gains on the ambiguous test cases are smaller but still considerable (+0.077 absolute, 20% relative gain), and accuracy on non-ambiguous test cases is again similar in this case. This supports our thesis that modelling dialect identification as a multi-label problem potentially has a positive impact.

pdf bib abs

Beyond Correlation: Making Sense of the Score Differences of New MT Evaluation Metrics
Chi-kiu Lo | Rebecca Knowles | Cyril Goutte
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track

While many new automatic metrics for machine translation evaluation have been proposed in recent years, BLEU scores are still used as the primary metric in the vast majority of MT research papers. There are many reasons that researchers may be reluctant to switch to new metrics, from external pressures (reviewers, prior work) to the ease of use of metric toolkits. Another reason is a lack of intuition about the meaning of novel metric scores. In this work, we examine “rules of thumb” about metric score differences and how they do (and do not) correspond to human judgments of statistically significant differences between systems. In particular, we show that common rules of thumb about BLEU score differences do not in fact guarantee that human annotators will find significant differences between systems. We also show ways in which these rules of thumb fail to generalize across translation directions or domains.

2022

pdf bib abs

Transfer Learning Improves French Cross-Domain Dialect Identification: NRC @ VarDial 2022
Gabriel Bernier-Colborne | Serge Leger | Cyril Goutte
Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects

We describe the systems developed by the National Research Council Canada for the French Cross-Domain Dialect Identification shared task at the 2022 VarDial evaluation campaign. We evaluated two different approaches to this task: SVM and probabilistic classifiers exploiting n-grams as features, and trained from scratch on the data provided; and a pre-trained French language model, CamemBERT, that we fine-tuned on the dialect identification task. The latter method turned out to improve the macro-F1 score on the test set from 0.344 to 0.430 (25% increase), which indicates that transfer learning can be helpful for dialect identification.

pdf bib abs

Refining an Almost Clean Translation Memory Helps Machine Translation
Shivendra Bhardwa | David Alfonso-Hermelo | Philippe Langlais | Gabriel Bernier-Colborne | Cyril Goutte | Michel Simard
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

While recent studies have been dedicated to cleaning very noisy parallel corpora to improve Machine Translation training, we focus in this work on filtering a large and mostly clean Translation Memory. This problem of practical interest has not received much consideration from the community, in contrast with, for example, filtering large web-mined parallel corpora. We experiment with an extensive, multi-domain proprietary Translation Memory and compare five approaches involving deep-, feature-, and heuristic-based solutions. We propose two ways of evaluating this task, manual annotation and resulting Machine Translation quality. We report significant gains over a state-of-the-art, off-the-shelf cleaning system, using two MT engines.

Cyril Goutte

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2010

2009

2007

2005

2004

2003

2002

Co-authors

Venues