Guillaume Wisniewski


2021

pdf bib
Biais de genre dans un système de traduction automatiqueneuronale : une étude préliminaire (Gender Bias in Neural Translation : a preliminary study )
Guillaume Wisniewski | Lichao Zhou | Nicolas Ballier | François Yvon
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Cet article présente les premiers résultats d’une étude en cours sur les biais de genre dans les corpus d’entraînements et dans les systèmes de traduction neuronale. Nous étudions en particulier un corpus minimal et contrôlé pour mesurer l’intensité de ces biais dans les deux directions anglais-français et français-anglais ; ce cadre contrôlé nous permet également d’analyser les représentations internes manipulées par le système pour réaliser ses prédictions lexicales, ainsi que de formuler des hypothèses sur la manière dont ce biais se distribue dans les représentations du système.

pdf bib
Are Neural Networks Extracting Linguistic Properties or Memorizing Training Data? An Observation with a Multilingual Probe for Predicting Tense
Bingzhi Li | Guillaume Wisniewski
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

We evaluate the ability of Bert embeddings to represent tense information, taking French and Chinese as a case study. In French, the tense information is expressed by verb morphology and can be captured by simple surface information. On the contrary, tense interpretation in Chinese is driven by abstract, lexical, syntactic and even pragmatic information. We show that while French tenses can easily be predicted from sentence representations, results drop sharply for Chinese, which suggests that Bert is more likely to memorize shallow patterns from the training data rather than uncover abstract properties.

2020

pdf bib
Analyse d’erreurs de transcriptions phonémiques automatiques d’une langue « rare » : le na (mosuo) (Analyzing errors in automatic phonemic transcriptions of the Na (Mosuo) language (SinoTibetan family) Automatic phonemic transcription tools now reach high levels of accuracy on a single speaker with relatively small amounts of training data: on the order two to three hours of transcribed speech)
Alexis Michaud | Oliver Adams | Séverine Guillaume | Guillaume Wisniewski
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 1 : Journées d'Études sur la Parole

Les systèmes de reconnaissance automatique de la parole atteignent désormais des degrés de précision élevés sur la base d’un corpus d’entraînement limité à deux ou trois heures d’enregistrements transcrits (pour un système mono-locuteur). Au-delà de l’intérêt pratique que présentent ces avancées technologiques pour les tâches de documentation de langues rares et en danger, se pose la question de leur apport pour la réflexion du phonéticien/phonologue. En effet, le modèle acoustique prend en entrée des transcriptions qui reposent sur un ensemble d’hypothèses plus ou moins explicites. Le modèle acoustique, décalqué (par des méthodes statistiques) de l’écrit du linguiste, peut-il être interrogé par ce dernier, en un jeu de miroir ? Notre étude s’appuie sur des exemples d’une langue « rare » de la famille sino-tibétaine, le na (mosuo), pour illustrer la façon dont l’analyse d’erreurs permet une confrontation renouvelée avec le signal acoustique.

pdf bib
Phonemic Transcription of Low-Resource Languages: To What Extent can Preprocessing be Automated?
Guillaume Wisniewski | Séverine Guillaume | Alexis Michaud
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Automatic Speech Recognition for low-resource languages has been an active field of research for more than a decade. It holds promise for facilitating the urgent task of documenting the world’s dwindling linguistic diversity. Various methodological hurdles are encountered in the course of this exciting development, however. A well-identified difficulty is that data preprocessing is not at all trivial: data collected in classical fieldwork are usually tailored to the needs of the linguist who collects them, and there is baffling diversity in formats and annotation schema, even among fieldworkers who use the same software package (such as ELAN). The tests reported here (on Yongning Na and other languages from the Pangloss Collection, an open archive of endangered languages) explore some possibilities for automating the process of data preprocessing: assessing to what extent it is possible to bypass the involvement of language experts for menial tasks of data preparation for Natural Language Processing (NLP) purposes. What is at stake is the accessibility of language archive data for a range of NLP tasks and beyond.

2019

pdf bib
Phonetic Normalization for Machine Translation of User Generated Content
José Carlos Rosales Núñez | Djamé Seddah | Guillaume Wisniewski
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

We present an approach to correct noisy User Generated Content (UGC) in French aiming to produce a pretreatement pipeline to improve Machine Translation for this kind of non-canonical corpora. In order to do so, we have implemented a character-based neural model phonetizer to produce IPA pronunciations of words. In this way, we intend to correct grammar, vocabulary and accentuation errors often present in noisy UGC corpora. Our method leverages on the fact that some errors are due to confusion induced by words with similar pronunciation which can be corrected using a phonetic look-up table to produce normalization candidates. These potential corrections are then encoded in a lattice and ranked using a language model to output the most probable corrected phrase. Compare to using other phonetizers, our method boosts a transformer-based machine translation system on UGC.

pdf bib
How Bad are PoS Tagger in Cross-Corpora Settings? Evaluating Annotation Divergence in the UD Project.
Guillaume Wisniewski | François Yvon
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

The performance of Part-of-Speech tagging varies significantly across the treebanks of the Universal Dependencies project. This work points out that these variations may result from divergences between the annotation of train and test sets. We show how the annotation variation principle, introduced by Dickinson and Meurers (2003) to automatically detect errors in gold standard, can be used to identify inconsistencies between annotations; we also evaluate their impact on prediction performance.

pdf bib
Comparison between NMT and PBSMT Performance for Translating Noisy User-Generated Content
José Carlos Rosales Núñez | Djamé Seddah | Guillaume Wisniewski
Proceedings of the 22nd Nordic Conference on Computational Linguistics

This work compares the performances achieved by Phrase-Based Statistical Machine Translation systems (PB-SMT) and attention-based Neuronal Machine Translation systems (NMT) when translating User Generated Content (UGC), as encountered in social medias, from French to English. We show that, contrary to what could be expected, PBSMT outperforms NMT when translating non-canonical inputs. Our error analysis uncovers the specificities of UGC that are problematic for sequential NMT architectures and suggests new avenue for improving NMT models.

pdf bib
Combien d’exemples de tests sont-ils nécessaires à une évaluation fiable ? Quelques observations sur l’évaluation de l’analyse morphosyntaxique du français. (Some observations on the evaluation of PoS taggers)
Guillaume Wisniewski
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts

L’objectif de ce travail est de présenter plusieurs observations, sur l’évaluation des analyseurs morphosyntaxique en français, visant à remettre en cause le cadre habituel de l’apprentissage statistique dans lequel les ensembles de test et d’apprentissage sont fixés arbitrairement et indépendemment du modèle considéré. Nous montrons qu’il est possible de considérer des ensembles de test plus petits que ceux généralement utilisés sans conséquences sur la qualité de l’évaluation. Les exemples ainsi « économisés » peuvent être utilisés en apprentissage pour améliorer les performances des systèmes notamment dans des tâches d’adaptation au domaine.

2018

pdf bib
Quantifying training challenges of dependency parsers
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the 27th International Conference on Computational Linguistics

Not all dependencies are equal when training a dependency parser: some are straightforward enough to be learned with only a sample of data, others embed more complexity. This work introduces a series of metrics to quantify those differences, and thereby to expose the shortcomings of various parsing algorithms and strategies. Apart from a more thorough comparison of parsing systems, these new tools also prove useful for characterizing the information conveyed by cross-lingual parsers, in a quantitative but still interpretable way.

pdf bib
Automatically Selecting the Best Dependency Annotation Design with Dynamic Oracles
Guillaume Wisniewski | Ophélie Lacroix | François Yvon
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

This work introduces a new strategy to compare the numerous conventions that have been proposed over the years for expressing dependency structures and discover the one for which a parser will achieve the highest parsing performance. Instead of associating each sentence in the training set with a single gold reference we propose to consider a set of references encoding alternative syntactic representations. Training a parser with a dynamic oracle will then automatically select among all alternatives the reference that will be predicted with the highest accuracy. Experiments on the UD corpora show the validity of this approach.

pdf bib
Exploiting Dynamic Oracles to Train Projective Dependency Parsers on Non-Projective Trees
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Because the most common transition systems are projective, training a transition-based dependency parser often implies to either ignore or rewrite the non-projective training examples, which has an adverse impact on accuracy. In this work, we propose a simple modification of dynamic oracles, which enables the use of non-projective data when training projective parsers. Evaluation on 73 treebanks shows that our method achieves significant gains (+2 to +7 UAS for the most non-projective languages) and consistently outperforms traditional projectivization and pseudo-projectivization approaches.

pdf bib
Automated Paraphrase Lattice Creation for HyTER Machine Translation Evaluation
Marianna Apidianaki | Guillaume Wisniewski | Anne Cocos | Chris Callison-Burch
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

We propose a variant of a well-known machine translation (MT) evaluation metric, HyTER (Dreyer and Marcu, 2012), which exploits reference translations enriched with meaning equivalent expressions. The original HyTER metric relied on hand-crafted paraphrase networks which restricted its applicability to new data. We test, for the first time, HyTER with automatically built paraphrase lattices. We show that although the metric obtains good results on small and carefully curated data with both manually and automatically selected substitutes, it achieves medium performance on much larger and noisier datasets, demonstrating the limits of the metric for tuning and evaluation of current MT systems.

pdf bib
Errator: a Tool to Help Detect Annotation Errors in the Universal Dependencies Project
Guillaume Wisniewski
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Analyse morpho-syntaxique en présence d’alternance codique (PoS tagging of Code Switching)
José Carlos Rosales Núñez | Guillaume Wisniewski
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

L’alternance codique est le phénomène qui consiste à alterner les langues au cours d’une même conversation ou d’une même phrase. Avec l’augmentation du volume généré par les utilisateurs, ce phénomène essentiellement oral, se retrouve de plus en plus dans les textes écrits, nécessitant d’adapter les tâches et modèles de traitement automatique de la langue à ce nouveau type d’énoncés. Ce travail présente la collecte et l’annotation en partie du discours d’un corpus d’énoncés comportant des alternances codiques et évalue leur impact sur la tâche d’analyse morpho-syntaxique.

pdf bib
Divergences entre annotations dans le projet Universal Dependencies et leur impact sur l’évaluation des performance d’étiquetage morpho-syntaxique (Evaluating Annotation Divergences in the UD Project)
Guillaume Wisniewski | François Yvon
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

Ce travail montre que la dégradation des performances souvent observée lors de l’application d’un analyseur morpho-syntaxique à des données hors domaine résulte souvent d’incohérences entre les annotations des ensembles de test et d’apprentissage. Nous montrons comment le principe de variation des annotations, introduit par Dickinson & Meurers (2003) pour identifier automatiquement les erreurs d’annotation, peut être utilisé pour identifier ces incohérences et évaluer leur impact sur les performances des analyseurs morpho-syntaxiques.

2017

pdf bib
A Systematic Comparison of Syntactic Representations of Dependency Parsing
Guillaume Wisniewski | Ophélie Lacroix
Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017)

pdf bib
LIMSI Submission for WMT’17 Shared Task on Bandit Learning
Guillaume Wisniewski
Proceedings of the Second Conference on Machine Translation

pdf bib
LIMSI@CoNLL’17: UD Shared Task
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

This paper describes LIMSI’s submission to the CoNLL 2017 UD Shared Task, which is focused on small treebanks, and how to improve low-resourced parsing only by ad hoc combination of multiple views and resources. We present our approach for low-resourced parsing, together with a detailed analysis of the results for each test treebank. We also report extensive analysis experiments on model selection for the PUD treebanks, and on annotation consistency among UD treebanks.

pdf bib
Adaptation au domaine pour l’analyse morpho-syntaxique (Domain Adaptation for PoS tagging)
Éléonor Bartenlian | Margot Lacour | Matthieu Labeau | Alexandre Allauzen | Guillaume Wisniewski | François Yvon
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts

Ce travail cherche à comprendre pourquoi les performances d’un analyseur morpho-syntaxiques chutent fortement lorsque celui-ci est utilisé sur des données hors domaine. Nous montrons à l’aide d’une expérience jouet que ce comportement peut être dû à un phénomène de masquage des caractéristiques lexicalisées par les caractéristiques non lexicalisées. Nous proposons plusieurs modèles essayant de réduire cet effet.

pdf bib
Don’t Stop Me Now! Using Global Dynamic Oracles to Correct Training Biases of Transition-Based Dependency Parsers
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

This paper formalizes a sound extension of dynamic oracles to global training, in the frame of transition-based dependency parsers. By dispensing with the pre-computation of references, this extension widens the training strategies that can be entertained for such parsers; we show this by revisiting two standard training procedures, early-update and max-violation, to correct some of their search space sampling biases. Experimentally, on the SPMRL treebanks, this improvement increases the similarity between the train and test distributions and yields performance improvements up to 0.7 UAS, without any computation overhead.

2016

pdf bib
Apprentissage d’analyseur en dépendances cross-lingue par projection partielle de dépendances (Cross-lingual learning of dependency parsers from partially projected dependencies )
Ophélie Lacroix | Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)

Cet article présente une méthode simple de transfert cross-lingue de dépendances. Nous montrons tout d’abord qu’il est possible d’apprendre un analyseur en dépendances par transition à partir de données partiellement annotées. Nous proposons ensuite de construire de grands ensembles de données partiellement annotés pour plusieurs langues cibles en projetant les dépendances via les liens d’alignement les plus sûrs. En apprenant des analyseurs pour les langues cibles à partir de ces données partielles, nous montrons que cette méthode simple obtient des performances qui rivalisent avec celles de méthodes état-de-l’art récentes, tout en ayant un coût algorithmique moindre.

pdf bib
Ne nous arrêtons pas en si bon chemin : améliorations de l’apprentissage global d’analyseurs en dépendances par transition (Don’t Stop Me Now ! Improved Update Strategies for Global Training of Transition-Based)
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)

Dans cet article, nous proposons trois améliorations simples pour l’apprentissage global d’analyseurs en dépendances par transition de type A RC E AGER : un oracle non déterministe, la reprise sur le même exemple après une mise à jour et l’entraînement en configurations sous-optimales. Leur combinaison apporte un gain moyen de 0,2 UAS sur le corpus SPMRL. Nous introduisons également un cadre général permettant la comparaison systématique de ces stratégies et de la plupart des variantes connues. Nous montrons que la littérature n’a étudié que quelques stratégies parmi les nombreuses variations possibles, négligeant ainsi plusieurs pistes d’améliorations potentielles.

pdf bib
Investigating gender adaptation for speech translation
Rachel Bawden | Guillaume Wisniewski | Hélène Maynard
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters)

In this paper we investigate the impact of the integration of context into dialogue translation. We present a new contextual parallel corpus of television subtitles and show how taking into account speaker gender can significantly improve machine translation quality in terms of B LEU and M ETEOR scores. We perform a manual analysis, which suggests that these improvements are not necessary related to the morphological consequences of speaker gender, but to more general linguistic divergences.

pdf bib
Cross-lingual and Supervised Models for Morphosyntactic Annotation: a Comparison on Romanian
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Because of the small size of Romanian corpora, the performance of a PoS tagger or a dependency parser trained with the standard supervised methods fall far short from the performance achieved in most languages. That is why, we apply state-of-the-art methods for cross-lingual transfer on Romanian tagging and parsing, from English and several Romance languages. We compare the performance with monolingual systems trained with sets of different sizes and establish that training on a few sentences in target language yields better results than transferring from large datasets in other languages.

pdf bib
Cross-lingual Dependency Transfer : What Matters? Assessing the Impact of Pre- and Post-processing
Ophélie Lacroix | Guillaume Wisniewski | François Yvon
Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP

pdf bib
Cross-lingual alignment transfer: a chicken-and-egg story?
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP

pdf bib
LIMSI@WMT’16: Machine Translation of News
Alexandre Allauzen | Lauriane Aufrant | Franck Burlot | Ophélie Lacroix | Elena Knyazeva | Thomas Lavergne | Guillaume Wisniewski | François Yvon
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
Frustratingly Easy Cross-Lingual Transfer for Transition-Based Dependency Parsing
Ophélie Lacroix | Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Zero-resource Dependency Parsing: Boosting Delexicalized Cross-lingual Transfer with Linguistic Knowledge
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

This paper studies cross-lingual transfer for dependency parsing, focusing on very low-resource settings where delexicalized transfer is the only fully automatic option. We show how to boost parsing performance by rewriting the source sentences so as to better match the linguistic regularities of the target language. We contrast a data-driven approach with an approach relying on linguistically motivated rules automatically extracted from the World Atlas of Language Structures. Our findings are backed up by experiments involving 40 languages. They show that both approaches greatly outperform the baseline, the knowledge-driven method yielding the best accuracies, with average improvements of +2.9 UAS, and up to +90 UAS (absolute) on some frequent PoS configurations.

2015

pdf bib
Why Predicting Post-Edition is so Hard? Failure Analysis of LIMSI Submission to the APE Shared Task
Guillaume Wisniewski | Nicolas Pécheux | François Yvon
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf bib
Apprentissage par imitation pour l’étiquetage de séquences : vers une formalisation des méthodes d’étiquetage easy-first
Elena Knyazeva | Guillaume Wisniewski | François Yvon
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

De nombreuses méthodes ont été proposées pour accélérer la prédiction d’objets structurés (tels que les arbres ou les séquences), ou pour permettre la prise en compte de dépendances plus riches afin d’améliorer les performances de la prédiction. Ces méthodes reposent généralement sur des techniques d’inférence approchée et ne bénéficient d’aucune garantie théorique aussi bien du point de vue de la qualité de la solution trouvée que du point de vue de leur critère d’apprentissage. Dans ce travail, nous étudions une nouvelle formulation de l’apprentissage structuré qui consiste à voir celui-ci comme un processus incrémental au cours duquel la sortie est construite de façon progressive. Ce cadre permet de formaliser plusieurs approches de prédiction structurée existantes. Grâce au lien que nous faisons entre apprentissage structuré et apprentissage par renforcement, nous sommes en mesure de proposer une méthode théoriquement bien justifiée pour apprendre des méthodes d’inférence approchée. Les expériences que nous réalisons sur quatre tâches de TAL valident l’approche proposée.

pdf bib
Oublier ce qu’on sait, pour mieux apprendre ce qu’on ne sait pas : une étude sur les contraintes de type dans les modèles CRF
Nicolas Pécheux | Alexandre Allauzen | Thomas Lavergne | Guillaume Wisniewski | François Yvon
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Quand on dispose de connaissances a priori sur les sorties possibles d’un problème d’étiquetage, il semble souhaitable d’inclure cette information lors de l’apprentissage pour simplifier la tâche de modélisation et accélérer les traitements. Pourtant, même lorsque ces contraintes sont correctes et utiles au décodage, leur utilisation lors de l’apprentissage peut dégrader sévèrement les performances. Dans cet article, nous étudions ce paradoxe et montrons que le manque de contraste induit par les connaissances entraîne une forme de sous-apprentissage qu’il est cependant possible de limiter.

2014

pdf bib
LIMSI Submission for WMT’14 QE Task
Guillaume Wisniewski | Nicolas Pécheux | Alexander Allauzen | François Yvon
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
A Corpus of Machine Translation Errors Extracted from Translation Students Exercises
Guillaume Wisniewski | Natalie Kübler | François Yvon
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we present a freely available corpus of automatic translations accompanied with post-edited versions, annotated with labels identifying the different kinds of errors made by the MT system. These data have been extracted from translation students exercises that have been corrected by a senior professor. This corpus can be useful for training quality estimation tools and for analyzing the types of errors made MT system.

pdf bib
Cross-Lingual POS Tagging through Ambiguous Learning: First Experiments (Apprentissage partiellement supervisé d’un étiqueteur morpho-syntaxique par transfert cross-lingue) [in French]
Guillaume Wisniewski | Nicolas Pécheux | Elena Knyazeva | Alexandre Allauzen | François Yvon
Proceedings of TALN 2014 (Volume 1: Long Papers)

pdf bib
Cross-Lingual Part-of-Speech Tagging through Ambiguous Learning
Guillaume Wisniewski | Nicolas Pécheux | Souhir Gahbiche-Braham | François Yvon
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

pdf bib
LIMSI Submission for the WMT‘13 Quality Estimation Task: an Experiment with N-Gram Posteriors
Anil Kumar Singh | Guillaume Wisniewski | François Yvon
Proceedings of the Eighth Workshop on Statistical Machine Translation

pdf bib
On the Predictability of Human Assessment: when Matrix Completion Meets NLP Evaluation
Guillaume Wisniewski
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Design and Analysis of a Large Corpus of Post-Edited Translations: Quality Estimation, Failure Analysis and the Variability of Post-Edition
Guillaume Wisniewski | Anil Kumar Singh | Natalia Segal | François Yvon
Proceedings of Machine Translation Summit XIV: Papers

pdf bib
A corpus of post-edited translations (Un corpus d’erreurs de traduction) [in French]
Guillaume Wisniewski | Anil Kumar Singh | Natalia Segal | François Yvon
Proceedings of TALN 2013 (Volume 2: Short Papers)

2012

pdf bib
Non-Linear Models for Confidence Estimation
Yong Zhuang | Guillaume Wisniewski | François Yvon
Proceedings of the Seventh Workshop on Statistical Machine Translation

pdf bib
LIMSI @ WMT12
Hai-Son Le | Thomas Lavergne | Alexandre Allauzen | Marianna Apidianaki | Li Gong | Aurélien Max | Artem Sokolov | Guillaume Wisniewski | François Yvon
Proceedings of the Seventh Workshop on Statistical Machine Translation

pdf bib
WSD for n-best reranking and local language modeling in SMT
Marianna Apidianaki | Guillaume Wisniewski | Artem Sokolov | Aurélien Max | François Yvon
Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
Non-linear n-best List Reranking with Few Features
Artem Sokolov | Guillaume Wisniewski | François Yvon
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers

In Machine Translation, it is customary to compute the model score of a predicted hypothesis as a linear combination of multiple features, where each feature assesses a particular facet of the hypothesis. The choice of a linear combination is usually justified by the possibility of efficient inference (decoding); yet, the appropriateness of this simple combination scheme to the task at hand is rarely questioned. In this paper, we propose an approach that replaces the linear scoring function with a non-linear scoring function. To investigate the applicability of this approach, we rescore n-best lists generated with a conventional machine translation engine (using a linear scoring function for generating its hypotheses) with a non-linear scoring function learned using the learning-to-rank framework. Moderate, though consistent, gains in BLEU are demonstrated on the WMT’10, WMT’11 and WMT’12 test sets.

pdf bib
Computing Lattice BLEU Oracle Scores for Machine Translation
Artem Sokolov | Guillaume Wisniewski | François Yvon
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

2011

pdf bib
LIMSI @ WMT11
Alexandre Allauzen | Hélène Bonneau-Maynard | Hai-Son Le | Aurélien Max | Guillaume Wisniewski | François Yvon | Gilles Adda | Josep Maria Crego | Adrien Lardilleux | Thomas Lavergne | Artem Sokolov
Proceedings of the Sixth Workshop on Statistical Machine Translation

2010

pdf bib
LIMSI @ IWSLT 2010
Alexandre Allauzen | Josep M. Crego | İlknur Durgar El-Kahlout | Le Hai-Son | Guillaume Wisniewski | François Yvon
Proceedings of the 7th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes LIMSI’s Statistical Machine Translation systems (SMT) for the IWSLT evaluation, where we participated in two tasks (Talk for English to French and BTEC for Turkish to English). For the Talk task, we studied an extension of our in-house n-code SMT system (the integration of a bilingual reordering model over generalized translation units), as well as the use of training data extracted from Wikipedia in order to adapt the target language model. For the BTEC task, we concentrated on pre-processing schemes on the Turkish side in order to reduce the morphological discrepancies with the English side. We also evaluated the use of two different continuous space language models for such a small size of training data.

pdf bib
Refining Word Alignment with Discriminative Training
Nadi Tomeh | Alexandre Allauzen | François Yvon | Guillaume Wisniewski
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

The quality of statistical machine translation systems depends on the quality of the word alignments that are computed during the translation model training phase. IBM alignment models, as implemented in the GIZA++ toolkit, constitute the de facto standard for performing these computations. The resulting alignments and translation models are however very noisy, and several authors have tried to improve them. In this work, we propose a simple and effective approach, which considers alignment as a series of independent binary classification problems in the alignment matrix. Through extensive feature engineering and the use of stacking techniques, we were able to obtain alignments much closer to manually defined references than those obtained by the IBM models. These alignments also yield better translation models, delivering improved performance in a large scale Arabic to English translation task.

pdf bib
Mining Naturally-occurring Corrections and Paraphrases from Wikipedia’s Revision History
Aurélien Max | Guillaume Wisniewski
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Naturally-occurring instances of linguistic phenomena are important both for training and for evaluating automatic text processing. When available in large quantities, they also prove interesting material for linguistic studies. In this article, we present WiCoPaCo (Wikipedia Correction and Paraphrase Corpus), a new freely-available resource built by automatically mining Wikipedia’s revision history. The WiCoPaCo corpus focuses on local modifications made by human revisors and include various types of corrections (such as spelling error or typographical corrections) and rewritings, which can be categorized broadly into meaning-preserving and meaning-altering revisions. We present an initial hand-built typology of these revisions, but the resource allows for any possible annotation scheme. We discuss the main motivations for building such a resource and describe the main technical details guiding its construction. We also present applications and data analysis on French and report initial results on spelling error correction and morphosyntactic rewriting. The WiCoPaCo corpus can be freely downloaded from http://wicopaco.limsi.fr.

pdf bib
Training Continuous Space Language Models: Some Practical Issues
Hai Son Le | Alexandre Allauzen | Guillaume Wisniewski | François Yvon
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

pdf bib
Assessing Phrase-Based Translation Models with Oracle Decoding
Guillaume Wisniewski | Alexandre Allauzen | François Yvon
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

pdf bib
Recueil et analyse d’un corpus écologique de corrections orthographiques extrait des révisions de Wikipédia
Guillaume Wisniewski | Aurélien Max | François Yvon
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Dans cet article, nous introduisons une méthode à base de règles permettant d’extraire automatiquement de l’historique des éditions de l’encyclopédie collaborative Wikipédia des corrections orthographiques. Cette méthode nous a permis de construire un corpus d’erreurs composé de 72 483 erreurs lexicales (non-word errors) et 74 100 erreurs grammaticales (real-word errors). Il n’existe pas, à notre connaissance, de plus gros corpus d’erreurs écologiques librement disponible. En outre, les techniques mises en oeuvre peuvent être facilement transposées à de nombreuses autres langues. La collecte de ce corpus ouvre de nouvelles perspectives pour l’étude des erreurs fréquentes ainsi que l’apprentissage et l’évaluation des correcteurs orthographiques automatiques. Plusieurs expériences illustrant son intérêt sont proposées.