Marion Potet

2012

Towards a better understanding of statistical post-editing
Marion Potet | Laurent Besacier | Hervé Blanchon | Marwen Azouzi
Proceedings of the 9th International Workshop on Spoken Language Translation: Papers

We describe several experiments to better understand the usefulness of statistical post-edition (SPE) to improve phrase-based statistical MT (PBMT) systems raw outputs. Whatever the size of the training corpus, we show that SPE systems trained on general domain data offers no breakthrough to our baseline general domain PBMT system. However, using manually post-edited system outputs to train the SPE led to a slight improvement in the translations quality compared with the use of professional reference translations. We also show that SPE is far more effective for domain adaptation, mainly because it recovers a lot of specific terms unknown to our general PBMT system. Finally, we compare two domain adaptation techniques, post-editing a general domain PBMT system vs building a new domain-adapted PBMT system with two different techniques, and show that the latter outperforms the first one. Yet, when the PBMT is a “black box”, SPE trained with post-edited system outputs remains an interesting option for domain adaptation.

pdf bib abs

Collection of a Large Database of French-English SMT Output Corrections
Marion Potet | Emmanuelle Esperança-Rodier | Laurent Besacier | Hervé Blanchon
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Corpus-based approaches to machine translation (MT) rely on the availability of parallel corpora. To produce user-acceptable translation outputs, such systems need high quality data to be efficiency trained, optimized and evaluated. However, building high quality dataset is a relatively expensive task. In this paper, we describe the data collection and analysis of a large database of 10.881 SMT translation output hypotheses manually corrected. These post-editions were collected using Amazon's Mechanical Turk, following some ethical guidelines. A complete analysis of the collected data pointed out a high quality of the corrections with more than 87 % of the collected post-editions that improve hypotheses and more than 94 % of the crowdsourced post-editions which are at least of professional quality. We also post-edited 1,500 gold-standard reference translations (of bilingual parallel corpora generated by professional) and noticed that 72 % of these translations needed to be corrected during post-edition. We computed a proximity measure between the differents kind of translations and pointed out that reference translations are as far from the hypotheses than from the corrected hypotheses (i.e. the post-editions). In light of these last findings, we discuss the adequation of text-based generated reference translations to train setence-to-sentence based SMT systems.

Compte tenu de l’essor du Web et du développement des documents multilingues, le besoin de traductions “à la volée” est devenu une évidence. Cet article présente un système qui propose, pour une phrase donnée, non pas une unique traduction, mais une liste de N hypothèses de traductions en faisant appel à plusieurs moteurs de traduction pré-existants. Neufs moteurs de traduction automatique gratuits et disponibles sur leWeb ont été sélectionnés pour soumettre un texte à traduire et réceptionner sa traduction. Les traductions obtenues sont classées selon une métrique reposant sur l’utilisation d’un modèle de langage. Les expériences conduites ont montré que ce méta-moteur de traduction se révèle plus pertinent que l’utilisation d’un seul système de traduction.

Co-authors

Venues

Fix author

Marion Potet

2012

2011

2010

2009

Co-authors

Venues