Franck Burlot


2019

This paper describes Lingua Custodia’s submission to the WMT’19 news shared task for German-to-French on the topic of the EU elections. We report experiments on the adaptation of the terminology of a machine translation system to a specific topic, aimed at providing more accurate translations of specific entities like political parties and person names, given that the shared task provided no in-domain training parallel data dealing with the restricted topic. Our primary submission to the shared task uses backtranslation generated with a type of decoding allowing the insertion of constraints in the output in order to guarantee the correct translation of specific terms that are not necessarily observed in the data.

2018

Le nouvel état de l’art en traduction automatique (TA) s’appuie sur des méthodes neuronales, qui différent profondément des méthodes utilisées antérieurement. Les métriques automatiques classiques sont mal adaptées pour rendre compte de la nature du saut qualitatif observé. Cet article propose un protocole d’évaluation pour la traduction de l’anglais vers le français spécifiquement focalisé sur la compétence morphologique des systèmes de TA, en étudiant leurs performances sur différents phénomènes grammaticaux.
Neural Machine Translation (MT) has radically changed the way systems are developed. A major difference with the previous generation (Phrase-Based MT) is the way monolingual target data, which often abounds, is used in these two paradigms. While Phrase-Based MT can seamlessly integrate very large language models trained on billions of sentences, the best option for Neural MT developers seems to be the generation of artificial parallel data through back-translation - a technique that fails to fully take advantage of existing datasets. In this paper, we conduct a systematic study of back-translation, comparing alternative uses of monolingual data, as well as multiple data generation procedures. Our findings confirm that back-translation is very effective and give new explanations as to why this is the case. We also introduce new data simulation techniques that are almost as effective, yet much cheaper to implement.
Progress in the quality of machine translation output calls for new automatic evaluation procedures and metrics. In this paper, we extend the Morpheval protocol introduced by Burlot and Yvon (2017) for the English-to-Czech and English-to-Latvian translation directions to three additional language pairs, and report its use to analyze the results of WMT 2018’s participants for these language pairs. Considering additional, typologically varied source and target languages also enables us to draw some generalizations regarding this morphology-oriented evaluation procedure.

2017

Lorsqu’ils sont traduits depuis une langue à morphologie riche vers l’anglais, les mots-formes sources contiennent des marques d’informations grammaticales pouvant être jugées redondantes par rapport à l’anglais, causant une variabilité formelle qui nuit à l’estimation des modèles probabilistes. Un moyen bien documenté pour atténuer ce problème consiste à supprimer l’information non pertinente de la source en la normalisant. Ce pré-traitement est généralement effectué de manière déterministe, à l’aide de règles produites manuellement. Une telle normalisation est, par essence, sous-optimale et doit être adaptée pour chaque paire de langues. Nous présentons, dans cet article, une méthode simple pour rechercher automatiquement une normalisation optimale de la morphologie source par rapport à la langue cible et montrons que celle-ci peut améliorer la traduction automatique.

2016

This paper describes a two-step machine translation system that addresses the issue of translating into a morphologically rich language (English to Czech), by performing separately the translation and the generation of target morphology. The first step consists in translating from English into a normalized version of Czech, where some morphological information has been removed. The second step retrieves this information and re-inflects the normalized output, turning it into fully inflected Czech. We introduce different setups for the second step and evaluate the quality of their predictions over different MT systems trained on different amounts of parallel and monolingual data and report ways to adapt to different data sizes, which improves the translation in low-resource conditions, as well as when large training data is available.
This paper describes LIMSI’s submission to the MT track of IWSLT 2016. We report results for translation from English into Czech. Our submission is an attempt to address the difficulties of translating into a morphologically rich language by paying special attention to the morphology generation on target side. To this end, we propose two ways of improving the morphological fluency of the output: 1. by performing translation and inflection of the target language in two separate steps, and 2. by using a neural language model with characted-based word representation. We finally present the combination of both methods used for our primary system submission.

2015