Guillaume Wisniewski


2024

pdf bib
Establishing degrees of closeness between audio recordings along different dimensions using large-scale cross-lingual models
Maxime Fily | Guillaume Wisniewski | Severine Guillaume | Gilles Adda | Alexis Michaud
Findings of the Association for Computational Linguistics: EACL 2024

In the highly constrained context of low-resource language studies, we explore vector representations of speech from a pretrained model to determine their level of abstraction with regard to the audio signal. We propose a new unsupervised method using ABX tests on audio recordings with carefully curated metadata to shed light on the type of information present in the representations. ABX tests determine whether the representations computed by a multilingual speech model encode a given characteristic. Three experiments are devised: one on room acoustics aspects, one on linguistic genre, and one on phonetic aspects. The results confirm that the representations extracted from recordings with different linguistic/extra-linguistic characteristics differ along the same lines. Embedding more audio signal in one vector better discriminates extra-linguistic characteristics, whereas shorter snippets are better to distinguish segmental information. The method is fully unsupervised, potentially opening new research avenues for comparative work on under-documented languages.

pdf bib
Mesure du niveau de proximité entre enregistrements audio et évaluation indirecte du niveau d’abstraction des représentations issues d’un grand modèle de langage
Maxime Fily | Guillaume Wisniewski | Séverine Guillaume | Gilles Adda | Alexis Michaud
Actes des 35èmes Journées d'Études sur la Parole

Nous explorons les représentations vectorielles de la parole à partir d’un modèle pré-entraîné pour déterminer leur niveau d’abstraction par rapport au signal audio. Nous proposons une nouvelle méthode non-supervisée exploitant des données audio ayant des métadonnées soigneusement organisées pour apporter un éclairage sur les informations présentes dans les représentations. Des tests ABX déterminent si les représentations obtenues via un modèle de parole multilingue encodent une caractéristique donnée. Trois expériences sont présentées, portant sur la qualité acoustique de la pièce, le type de discours, ou le contenu phonétique. Les résultats confirment que les différences au niveau de caractéristiques linguistiques/extra-linguistiques d’enregistrements audio sont reflétées dans les représentations de ceux-ci. Plus la quantité d’audio par vecteur est importante, mieux elle permet de distinguer les caractéristiques extra-linguistiques. Plus elle est faible, et mieux nous pouvons distinguer les informations d’ordre phonétique/segmental. La méthode proposée ouvre de nouvelles pistes pour la recherche et les travaux comparatifs sur les langues peu dotées.

pdf bib
Évaluation de la Similarité Textuelle : Entre Sémantique et Surface dans les Représentations Neuronales
Julie Tytgat | Guillaume Wisniewski | Adrien Betrancourt
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position

La mesure de la similarité entre textes, qu’elle soit basée sur le sens, les caractères ou la phonétique, est essentielle dans de nombreuses applications. Les réseaux neuronaux, en transformant le texte en vecteurs, offrent une méthode pratique pour évaluer cette similarité. Cependant, l’utilisation de ces représentations pose un défi car les critères sous-jacents à cette similarité ne sont pas clairement définis, oscillant entre sémantique et surface. Notre étude, basée sur des expériences contrôlées, révèle que les différences de surface ont un impact plus significatif que les différences de sémantique sur les mesures de similarité entre les représentations neuronales des mots construites par de nombreux modèles pré-entrainés. Ces résultats soulèvent des questions sur la nature même de la similarité mesurée par les modèles neuronaux et leurs capacités à capturer les nuances sémantiques.

2023

pdf bib
Using Artificial French Data to Understand the Emergence of Gender Bias in Transformer Language Models
Lina Conti | Guillaume Wisniewski
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Numerous studies have demonstrated the ability of neural language models to learn various linguistic properties without direct supervision. This work takes an initial step towards exploring the less researched topic of how neural models discover linguistic properties of words, such as gender, as well as the rules governing their usage. We propose to use an artificial corpus generated by a PCFG based on French to precisely control the gender distribution in the training data and determine under which conditions a model correctly captures gender information or, on the contrary, appears gender-biased.

pdf bib
Fine-tuning MBART-50 with French and Farsi data to improve the translation of Farsi dislocations into English and French
Behnoosh Namdarzadeh | Sadaf Mohseni | Lichao Zhu | Guillaume Wisniewski | Nicolas Ballier
Proceedings of Machine Translation Summit XIX, Vol. 2: Users Track

In this paper, we discuss the improvements brought by the fine-tuning of mBART50 for the translation of a specific Farsi dataset of dislocations. Given our BLEU scores, our evaluation is mostly qualitative: we assess the improvements of our fine-tuning in the translations into French of our test dataset of Farsi. We describe the fine-tuning procedure and discuss the quality of the results in the translations from Farsi. We assess the sentences in the French translations that contain English tokens and for the English translations, we examine the ability of the fine- tuned system to translate Farsi dislocations into English without replicating the dislocated item as a double subject. We scrutinized the Farsi training data used to train for mBART50 (Tang et al., 2021). We fine-tuned mBART50 with samples from an in-house French-Farsi aligned translation of a short story. In spite of the scarcity of available resources, we found that fine- tuning with aligned French-Farsi data dramatically improved the grammatical well-formedness of the predictions for French, even if serious semantic issues remained. We replicated the experiment with the English translation of the same Farsi short story for a Farsi-English fine-tuning and found out that similar semantic inadequacies cropped up, and that some translations were worse than our mBART50 baseline. We showcased the fine-tuning of mBART50 with supplementary data and discussed the asymmetry of the situation, adding little data in the fine-tuning is sufficient to improve morpho-syntax for one language pair but seems to degrade translation to English.

pdf bib
Multi-way Variational NMT for UGC: Improving Robustness in Zero-shot Scenarios via Mixture Density Networks
José Rosales Núñez | Djamé Seddah | Guillaume Wisniewski
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

This work presents a novel Variational Neural Machine Translation (VNMT) architecture with enhanced robustness properties, which we investigate through a detailed case-study addressing noisy French user-generated content (UGC) translation to English. We show that the proposed model, with results comparable or superior to state-of-the-art VNMT, improves performance over UGC translation in a zero-shot evaluation scenario while keeping optimal translation scores on in-domain test sets. We elaborate on such results by visualizing and explaining how neural learning representations behave when processing UGC noise. In addition, we show that VNMT enforces robustness to the learned embeddings, which can be later used for robust transfer learning approaches.

pdf bib
Assessing the Capacity of Transformer to Abstract Syntactic Representations: A Contrastive Analysis Based on Long-distance Agreement
Bingzhi Li | Guillaume Wisniewski | Benoît Crabbé
Transactions of the Association for Computational Linguistics, Volume 11

Many studies have shown that transformers are able to predict subject-verb agreement, demonstrating their ability to uncover an abstract representation of the sentence in an unsupervised way. Recently, Li et al. (2021) found that transformers were also able to predict the object-past participle agreement in French, the modeling of which in formal grammar is fundamentally different from that of subject-verb agreement and relies on a movement and an anaphora resolution. To better understand transformers’ internal working, we propose to contrast how they handle these two kinds of agreement. Using probing and counterfactual analysis methods, our experiments on French agreements show that (i) the agreement task suffers from several confounders that partially question the conclusions drawn so far and (ii) transformers handle subject-verb and object-past participle agreements in a way that is consistent with their modeling in theoretical linguistics.

pdf bib
Investigating Techniques for a Deeper Understanding of Neural Machine Translation (NMT) Systems through Data Filtering and Fine-tuning Strategies
Lichao Zhu | Maria Zimina | Maud Bénard | Behnoosh Namdar | Nicolas Ballier | Guillaume Wisniewski | Jean-Baptiste Yunès
Proceedings of the Eighth Conference on Machine Translation

In the context of this biomedical shared task, we have implemented data filters to enhance the selection of relevant training data for fine- tuning from the available training data sources. Specifically, we have employed textometric analysis to detect repetitive segments within the test set, which we have then used for re- fining the training data used to fine-tune the mBart-50 baseline model. Through this approach, we aim to achieve several objectives: developing a practical fine-tuning strategy for training biomedical in-domain fr<>en models, defining criteria for filtering in-domain training data, and comparing model predictions, fine-tuning data in accordance with the test set to gain a deeper insight into the functioning of Neural Machine Translation (NMT) systems.

2022

pdf bib
How Distributed are Distributed Representations? An Observation on the Locality of Syntactic Information in Verb Agreement Tasks
Bingzhi Li | Guillaume Wisniewski | Benoit Crabbé
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

This work addresses the question of the localization of syntactic information encoded in the transformers representations. We tackle this question from two perspectives, considering the object-past participle agreement in French, by identifying, first, in which part of the sentence and, second, in which part of the representation the syntactic information is encoded. The results of our experiments, using probing, causal analysis and feature selection method, show that syntactic information is encoded locally in a way consistent with the French grammar.

pdf bib
Analyzing Gender Translation Errors to Identify Information Flows between the Encoder and Decoder of a NMT System
Guillaume Wisniewski | Lichao Zhu | Nicolas Ballier | François Yvon
Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Multiple studies have shown that existing NMT systems demonstrate some kind of “gender bias”. As a result, MT output appears to err more often for feminine forms and to amplify social gender misrepresentations, which is potentially harmful to users and practioners of these technologies. This paper continues this line of investigations and reports results obtained with a new test set in strictly controlled conditions. This setting allows us to better understand the multiple inner mechanisms that are causing these biases, which include the linguistic expressions of gender, the unbalanced distribution of masculine and feminine forms in the language, the modelling of morphological variation and the training process dynamics. To counterbalance these effects, we formulate several proposals and notably show that modifying the training loss can effectively mitigate such biases.

pdf bib
Fine-tuning pre-trained models for Automatic Speech Recognition, experiments on a fieldwork corpus of Japhug (Trans-Himalayan family)
Séverine Guillaume | Guillaume Wisniewski | Cécile Macaire | Guillaume Jacques | Alexis Michaud | Benjamin Galliot | Maximin Coavoux | Solange Rossato | Minh-Châu Nguyên | Maxime Fily
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages

This is a report on results obtained in the development of speech recognition tools intended to support linguistic documentation efforts. The test case is an extensive fieldwork corpus of Japhug, an endangered language of the Trans-Himalayan (Sino-Tibetan) family. The goal is to reduce the transcription workload of field linguists. The method used is a deep learning approach based on the language-specific tuning of a generic pre-trained representation model, XLS-R, using a Transformer architecture. We note difficulties in implementation, in terms of learning stability. But this approach brings significant improvements nonetheless. The quality of phonemic transcription is improved over earlier experiments; and most significantly, the new approach allows for reaching the stage of automatic word recognition. Subjective evaluation of the tool by the author of the training data confirms the usefulness of this approach.

pdf bib
Les représentations distribuées sont-elles vraiment distribuées ? Observations sur la localisation de l’information syntaxique dans les tâches d’accord du verbe en français (How Distributed are Distributed Representations ? An Observation on the Locality of Syntactic)
Bingzhi Li | Guillaume Wisniewski | Benoît Crabbé
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Ce travail aborde la question de la localisation de l’information syntaxique qui est encodée dans les représentations de transformers. En considérant la tâche d’accord objet-participe passé en français, les résultats de nos sondes linguistiques montrent que les informations nécessaires pour accomplir la tâche sont encodées d’une manière locale dans les représentations de mots entre l’antécédent du pronom relatif objet et le participe passé cible. En plus, notre analyse causale montre que le modèle s’appuie essentiellement sur les éléments linguistiquement motivés (i.e. antécédent et pronom relatif) pour prédire le nombre du participe passé.

pdf bib
Flux d’informations dans les systèmes encodeur-décodeur. Application à l’explication des biais de genre dans les systèmes de traduction automatique. (Information flow in encoder-decoder systems applied to the explanation of gender bias in machine translation systems)
Lichao Zhu | Guillaume Wisniewski | Nicolas Ballier | François Yvon
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Atelier TAL et Humanités Numériques (TAL-HN)

Ce travail présente deux séries d’expériences visant à identifier les flux d’information dans les systèmes de traduction neuronaux. La première série s’appuie sur une comparaison des décisions d’un modèle de langue et d’un modèle de traduction pour mettre en évidence le flux d’information provenant de la source. La seconde série met en évidence l’impact de ces flux sur l’apprentissage du système dans le cas particulier du transfert de l’information de genre.

pdf bib
Toward a Test Set of Dislocations in Persian for Neural Machine Translation
Behnoosh Namdarzadeh | Nicolas Ballier | Lichao Zhu | Guillaume Wisniewski | Jean-Baptiste Yunès
Proceedings of the Third International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2022) co-located with ICNLSP 2022

pdf bib
Biais de genre dans un système de traduction automatique neuronale : une étude des mécanismes de transfert cross-langue [Gender bias in a neural machine translation system: a study of crosslingual transfer mechanisms]
Guillaume Wisniewski | Lichao Zhu | Nicolas Ballier | François Yvon
Traitement Automatique des Langues, Volume 63, Numéro 1 : Varia [Varia]

pdf bib
The SPECTRANS System Description for the WMT22 Biomedical Task
Nicolas Ballier | Jean-baptiste Yunès | Guillaume Wisniewski | Lichao Zhu | Maria Zimina
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper describes the SPECTRANS submission for the WMT 2022 biomedical shared task. We present the results of our experiments using the training corpora and the JoeyNMT (Kreutzer et al., 2019) and SYSTRAN Pure Neural Server/ Advanced Model Studio toolkits for the language directions English to French and French to English. We compare the pre- dictions of the different toolkits. We also use JoeyNMT to fine-tune the model with a selection of texts from WMT, Khresmoi and UFAL data sets. We report our results and assess the respective merits of the different translated texts.

2021

pdf bib
Screening Gender Transfer in Neural Machine Translation
Guillaume Wisniewski | Lichao Zhu | Nicolas Bailler | François Yvon
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

This paper aims at identifying the information flow in state-of-the-art machine translation systems, taking as example the transfer of gender when translating from French into English. Using a controlled set of examples, we experiment several ways to investigate how gender information circulates in a encoder-decoder architecture considering both probing techniques as well as interventions on the internal representations used in the MT system. Our results show that gender information can be found in all token representations built by the encoder and the decoder and lead us to conclude that there are multiple pathways for gender transfer.

pdf bib
User-friendly Automatic Transcription of Low-resource Languages: Plugging ESPnet into Elpis
Oliver Adams | Benjamin Galliot | Guillaume Wisniewski | Nicholas Lambourne | Ben Foley | Rahasya Sanders-Dwyer | Janet Wiles | Alexis Michaud | Séverine Guillaume | Laurent Besacier | Christopher Cox | Katya Aplonova | Guillaume Jacques | Nathan Hill
Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

pdf bib
Are Neural Networks Extracting Linguistic Properties or Memorizing Training Data? An Observation with a Multilingual Probe for Predicting Tense
Bingzhi Li | Guillaume Wisniewski
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

We evaluate the ability of Bert embeddings to represent tense information, taking French and Chinese as a case study. In French, the tense information is expressed by verb morphology and can be captured by simple surface information. On the contrary, tense interpretation in Chinese is driven by abstract, lexical, syntactic and even pragmatic information. We show that while French tenses can easily be predicted from sentence representations, results drop sharply for Chinese, which suggests that Bert is more likely to memorize shallow patterns from the training data rather than uncover abstract properties.

pdf bib
Are Transformers a Modern Version of ELIZA? Observations on French Object Verb Agreement
Bingzhi Li | Guillaume Wisniewski | Benoit Crabbé
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Many recent works have demonstrated that unsupervised sentence representations of neural networks encode syntactic information by observing that neural language models are able to predict the agreement between a verb and its subject. We take a critical look at this line of research by showing that it is possible to achieve high accuracy on this agreement task with simple surface heuristics, indicating a possible flaw in our assessment of neural networks’ syntactic ability. Our fine-grained analyses of results on the long-range French object-verb agreement show that contrary to LSTMs, Transformers are able to capture a non-trivial amount of grammatical structure.

pdf bib
Biais de genre dans un système de traduction automatiqueneuronale : une étude préliminaire (Gender Bias in Neural Translation : a preliminary study )
Guillaume Wisniewski | Lichao Zhu | Nicolas Ballier | François Yvon
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Cet article présente les premiers résultats d’une étude en cours sur les biais de genre dans les corpus d’entraînements et dans les systèmes de traduction neuronale. Nous étudions en particulier un corpus minimal et contrôlé pour mesurer l’intensité de ces biais dans les deux directions anglais-français et français-anglais ; ce cadre contrôlé nous permet également d’analyser les représentations internes manipulées par le système pour réaliser ses prédictions lexicales, ainsi que de formuler des hypothèses sur la manière dont ce biais se distribue dans les représentations du système.

pdf bib
The SPECTRANS System Description for the WMT21 Terminology Task
Nicolas Ballier | Dahn Cho | Bilal Faye | Zong-You Ke | Hanna Martikainen | Mojca Pecman | Guillaume Wisniewski | Jean-Baptiste Yunès | Lichao Zhu | Maria Zimina-Poirot
Proceedings of the Sixth Conference on Machine Translation

This paper discusses the WMT 2021 terminology shared task from a “meta” perspective. We present the results of our experiments using the terminology dataset and the OpenNMT (Klein et al., 2017) and JoeyNMT (Kreutzer et al., 2019) toolkits for the language direction English to French. Our experiment 1 compares the predictions of the two toolkits. Experiment 2 uses OpenNMT to fine-tune the model. We report our results for the task with the evaluation script but mostly discuss the linguistic properties of the terminology dataset provided for the task. We provide evidence of the importance of text genres across scores, having replicated the evaluation scripts.

pdf bib
Understanding the Impact of UGC Specificities on Translation Quality
José Carlos Rosales Núñez | Djamé Seddah | Guillaume Wisniewski
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

This work takes a critical look at the evaluation of user-generated content automatic translation, the well-known specificities of which raise many challenges for MT. Our analyses show that measuring the average-case performance using a standard metric on a UGC test set falls far short of giving a reliable image of the UGC translation quality. That is why we introduce a new data set for the evaluation of UGC translation in which UGC specificities have been manually annotated using a fine-grained typology. Using this data set, we conduct several experiments to measure the impact of different kinds of UGC specificities on translation quality, more precisely than previously possible.

pdf bib
Noisy UGC Translation at the Character Level: Revisiting Open-Vocabulary Capabilities and Robustness of Char-Based Models
José Carlos Rosales Núñez | Guillaume Wisniewski | Djamé Seddah
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

This work explores the capacities of character-based Neural Machine Translation to translate noisy User-Generated Content (UGC) with a strong focus on exploring the limits of such approaches to handle productive UGC phenomena, which almost by definition, cannot be seen at training time. Within a strict zero-shot scenario, we first study the detrimental impact on translation performance of various user-generated content phenomena on a small annotated dataset we developed and then show that such models are indeed incapable of handling unknown letters, which leads to catastrophic translation failure once such characters are encountered. We further confirm this behavior with a simple, yet insightful, copy task experiment and highlight the importance of reducing the vocabulary size hyper-parameter to increase the robustness of character-based models for machine translation.

2020

pdf bib
Analyse d’erreurs de transcriptions phonémiques automatiques d’une langue « rare » : le na (mosuo) (Analyzing errors in automatic phonemic transcriptions of the Na (Mosuo) language (SinoTibetan family) Automatic phonemic transcription tools now reach high levels of accuracy on a single speaker with relatively small amounts of training data: on the order two to three hours of transcribed speech)
Alexis Michaud | Oliver Adams | Séverine Guillaume | Guillaume Wisniewski
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 1 : Journées d'Études sur la Parole

Les systèmes de reconnaissance automatique de la parole atteignent désormais des degrés de précision élevés sur la base d’un corpus d’entraînement limité à deux ou trois heures d’enregistrements transcrits (pour un système mono-locuteur). Au-delà de l’intérêt pratique que présentent ces avancées technologiques pour les tâches de documentation de langues rares et en danger, se pose la question de leur apport pour la réflexion du phonéticien/phonologue. En effet, le modèle acoustique prend en entrée des transcriptions qui reposent sur un ensemble d’hypothèses plus ou moins explicites. Le modèle acoustique, décalqué (par des méthodes statistiques) de l’écrit du linguiste, peut-il être interrogé par ce dernier, en un jeu de miroir ? Notre étude s’appuie sur des exemples d’une langue « rare » de la famille sino-tibétaine, le na (mosuo), pour illustrer la façon dont l’analyse d’erreurs permet une confrontation renouvelée avec le signal acoustique.

pdf bib
Phonemic Transcription of Low-Resource Languages: To What Extent can Preprocessing be Automated?
Guillaume Wisniewski | Séverine Guillaume | Alexis Michaud
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Automatic Speech Recognition for low-resource languages has been an active field of research for more than a decade. It holds promise for facilitating the urgent task of documenting the world’s dwindling linguistic diversity. Various methodological hurdles are encountered in the course of this exciting development, however. A well-identified difficulty is that data preprocessing is not at all trivial: data collected in classical fieldwork are usually tailored to the needs of the linguist who collects them, and there is baffling diversity in formats and annotation schema, even among fieldworkers who use the same software package (such as ELAN). The tests reported here (on Yongning Na and other languages from the Pangloss Collection, an open archive of endangered languages) explore some possibilities for automating the process of data preprocessing: assessing to what extent it is possible to bypass the involvement of language experts for menial tasks of data preparation for Natural Language Processing (NLP) purposes. What is at stake is the accessibility of language archive data for a range of NLP tasks and beyond.

2019

pdf bib
Combien d’exemples de tests sont-ils nécessaires à une évaluation fiable ? Quelques observations sur l’évaluation de l’analyse morphosyntaxique du français. (Some observations on the evaluation of PoS taggers)
Guillaume Wisniewski
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts

L’objectif de ce travail est de présenter plusieurs observations, sur l’évaluation des analyseurs morphosyntaxique en français, visant à remettre en cause le cadre habituel de l’apprentissage statistique dans lequel les ensembles de test et d’apprentissage sont fixés arbitrairement et indépendemment du modèle considéré. Nous montrons qu’il est possible de considérer des ensembles de test plus petits que ceux généralement utilisés sans conséquences sur la qualité de l’évaluation. Les exemples ainsi « économisés » peuvent être utilisés en apprentissage pour améliorer les performances des systèmes notamment dans des tâches d’adaptation au domaine.

pdf bib
Phonetic Normalization for Machine Translation of User Generated Content
José Carlos Rosales Núñez | Djamé Seddah | Guillaume Wisniewski
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

We present an approach to correct noisy User Generated Content (UGC) in French aiming to produce a pretreatement pipeline to improve Machine Translation for this kind of non-canonical corpora. In order to do so, we have implemented a character-based neural model phonetizer to produce IPA pronunciations of words. In this way, we intend to correct grammar, vocabulary and accentuation errors often present in noisy UGC corpora. Our method leverages on the fact that some errors are due to confusion induced by words with similar pronunciation which can be corrected using a phonetic look-up table to produce normalization candidates. These potential corrections are then encoded in a lattice and ranked using a language model to output the most probable corrected phrase. Compare to using other phonetizers, our method boosts a transformer-based machine translation system on UGC.

pdf bib
How Bad are PoS Tagger in Cross-Corpora Settings? Evaluating Annotation Divergence in the UD Project.
Guillaume Wisniewski | François Yvon
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

The performance of Part-of-Speech tagging varies significantly across the treebanks of the Universal Dependencies project. This work points out that these variations may result from divergences between the annotation of train and test sets. We show how the annotation variation principle, introduced by Dickinson and Meurers (2003) to automatically detect errors in gold standard, can be used to identify inconsistencies between annotations; we also evaluate their impact on prediction performance.

pdf bib
Comparison between NMT and PBSMT Performance for Translating Noisy User-Generated Content
José Carlos Rosales Núñez | Djamé Seddah | Guillaume Wisniewski
Proceedings of the 22nd Nordic Conference on Computational Linguistics

This work compares the performances achieved by Phrase-Based Statistical Machine Translation systems (PB-SMT) and attention-based Neuronal Machine Translation systems (NMT) when translating User Generated Content (UGC), as encountered in social medias, from French to English. We show that, contrary to what could be expected, PBSMT outperforms NMT when translating non-canonical inputs. Our error analysis uncovers the specificities of UGC that are problematic for sequential NMT architectures and suggests new avenue for improving NMT models.

2018

pdf bib
Analyse morpho-syntaxique en présence d’alternance codique (PoS tagging of Code Switching)
José Carlos Rosales Núñez | Guillaume Wisniewski
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

L’alternance codique est le phénomène qui consiste à alterner les langues au cours d’une même conversation ou d’une même phrase. Avec l’augmentation du volume généré par les utilisateurs, ce phénomène essentiellement oral, se retrouve de plus en plus dans les textes écrits, nécessitant d’adapter les tâches et modèles de traitement automatique de la langue à ce nouveau type d’énoncés. Ce travail présente la collecte et l’annotation en partie du discours d’un corpus d’énoncés comportant des alternances codiques et évalue leur impact sur la tâche d’analyse morpho-syntaxique.

pdf bib
Divergences entre annotations dans le projet Universal Dependencies et leur impact sur l’évaluation des performance d’étiquetage morpho-syntaxique (Evaluating Annotation Divergences in the UD Project)
Guillaume Wisniewski | François Yvon
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

Ce travail montre que la dégradation des performances souvent observée lors de l’application d’un analyseur morpho-syntaxique à des données hors domaine résulte souvent d’incohérences entre les annotations des ensembles de test et d’apprentissage. Nous montrons comment le principe de variation des annotations, introduit par Dickinson & Meurers (2003) pour identifier automatiquement les erreurs d’annotation, peut être utilisé pour identifier ces incohérences et évaluer leur impact sur les performances des analyseurs morpho-syntaxiques.

pdf bib
Les méthodes « apprendre à chercher » en traitement automatique des langues : un état de l’art [A survey of learning-to-search techniques in Natural Language Processing]
Elena Knyazeva | Guillaume Wisniewski | François Yvon
Traitement Automatique des Langues, Volume 59, Numéro 1 : Varia [Varia]

pdf bib
Quantifying training challenges of dependency parsers
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the 27th International Conference on Computational Linguistics

Not all dependencies are equal when training a dependency parser: some are straightforward enough to be learned with only a sample of data, others embed more complexity. This work introduces a series of metrics to quantify those differences, and thereby to expose the shortcomings of various parsing algorithms and strategies. Apart from a more thorough comparison of parsing systems, these new tools also prove useful for characterizing the information conveyed by cross-lingual parsers, in a quantitative but still interpretable way.

pdf bib
Errator: a Tool to Help Detect Annotation Errors in the Universal Dependencies Project
Guillaume Wisniewski
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Automatically Selecting the Best Dependency Annotation Design with Dynamic Oracles
Guillaume Wisniewski | Ophélie Lacroix | François Yvon
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

This work introduces a new strategy to compare the numerous conventions that have been proposed over the years for expressing dependency structures and discover the one for which a parser will achieve the highest parsing performance. Instead of associating each sentence in the training set with a single gold reference we propose to consider a set of references encoding alternative syntactic representations. Training a parser with a dynamic oracle will then automatically select among all alternatives the reference that will be predicted with the highest accuracy. Experiments on the UD corpora show the validity of this approach.

pdf bib
Exploiting Dynamic Oracles to Train Projective Dependency Parsers on Non-Projective Trees
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Because the most common transition systems are projective, training a transition-based dependency parser often implies to either ignore or rewrite the non-projective training examples, which has an adverse impact on accuracy. In this work, we propose a simple modification of dynamic oracles, which enables the use of non-projective data when training projective parsers. Evaluation on 73 treebanks shows that our method achieves significant gains (+2 to +7 UAS for the most non-projective languages) and consistently outperforms traditional projectivization and pseudo-projectivization approaches.

pdf bib
Automated Paraphrase Lattice Creation for HyTER Machine Translation Evaluation
Marianna Apidianaki | Guillaume Wisniewski | Anne Cocos | Chris Callison-Burch
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

We propose a variant of a well-known machine translation (MT) evaluation metric, HyTER (Dreyer and Marcu, 2012), which exploits reference translations enriched with meaning equivalent expressions. The original HyTER metric relied on hand-crafted paraphrase networks which restricted its applicability to new data. We test, for the first time, HyTER with automatically built paraphrase lattices. We show that although the metric obtains good results on small and carefully curated data with both manually and automatically selected substitutes, it achieves medium performance on much larger and noisier datasets, demonstrating the limits of the metric for tuning and evaluation of current MT systems.

2017

pdf bib
Adaptation au domaine pour l’analyse morpho-syntaxique (Domain Adaptation for PoS tagging)
Éléonor Bartenlian | Margot Lacour | Matthieu Labeau | Alexandre Allauzen | Guillaume Wisniewski | François Yvon
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts

Ce travail cherche à comprendre pourquoi les performances d’un analyseur morpho-syntaxiques chutent fortement lorsque celui-ci est utilisé sur des données hors domaine. Nous montrons à l’aide d’une expérience jouet que ce comportement peut être dû à un phénomène de masquage des caractéristiques lexicalisées par les caractéristiques non lexicalisées. Nous proposons plusieurs modèles essayant de réduire cet effet.

pdf bib
Don’t Stop Me Now! Using Global Dynamic Oracles to Correct Training Biases of Transition-Based Dependency Parsers
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

This paper formalizes a sound extension of dynamic oracles to global training, in the frame of transition-based dependency parsers. By dispensing with the pre-computation of references, this extension widens the training strategies that can be entertained for such parsers; we show this by revisiting two standard training procedures, early-update and max-violation, to correct some of their search space sampling biases. Experimentally, on the SPMRL treebanks, this improvement increases the similarity between the train and test distributions and yields performance improvements up to 0.7 UAS, without any computation overhead.

pdf bib
LIMSI@CoNLL’17: UD Shared Task
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

This paper describes LIMSI’s submission to the CoNLL 2017 UD Shared Task, which is focused on small treebanks, and how to improve low-resourced parsing only by ad hoc combination of multiple views and resources. We present our approach for low-resourced parsing, together with a detailed analysis of the results for each test treebank. We also report extensive analysis experiments on model selection for the PUD treebanks, and on annotation consistency among UD treebanks.

pdf bib
A Systematic Comparison of Syntactic Representations of Dependency Parsing
Guillaume Wisniewski | Ophélie Lacroix
Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017)

pdf bib
LIMSI Submission for WMT’17 Shared Task on Bandit Learning
Guillaume Wisniewski
Proceedings of the Second Conference on Machine Translation

2016

pdf bib
Apprentissage d’analyseur en dépendances cross-lingue par projection partielle de dépendances (Cross-lingual learning of dependency parsers from partially projected dependencies )
Ophélie Lacroix | Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)

Cet article présente une méthode simple de transfert cross-lingue de dépendances. Nous montrons tout d’abord qu’il est possible d’apprendre un analyseur en dépendances par transition à partir de données partiellement annotées. Nous proposons ensuite de construire de grands ensembles de données partiellement annotés pour plusieurs langues cibles en projetant les dépendances via les liens d’alignement les plus sûrs. En apprenant des analyseurs pour les langues cibles à partir de ces données partielles, nous montrons que cette méthode simple obtient des performances qui rivalisent avec celles de méthodes état-de-l’art récentes, tout en ayant un coût algorithmique moindre.

pdf bib
Ne nous arrêtons pas en si bon chemin : améliorations de l’apprentissage global d’analyseurs en dépendances par transition (Don’t Stop Me Now ! Improved Update Strategies for Global Training of Transition-Based)
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)

Dans cet article, nous proposons trois améliorations simples pour l’apprentissage global d’analyseurs en dépendances par transition de type A RC E AGER : un oracle non déterministe, la reprise sur le même exemple après une mise à jour et l’entraînement en configurations sous-optimales. Leur combinaison apporte un gain moyen de 0,2 UAS sur le corpus SPMRL. Nous introduisons également un cadre général permettant la comparaison systématique de ces stratégies et de la plupart des variantes connues. Nous montrons que la littérature n’a étudié que quelques stratégies parmi les nombreuses variations possibles, négligeant ainsi plusieurs pistes d’améliorations potentielles.

pdf bib
Investigating gender adaptation for speech translation
Rachel Bawden | Guillaume Wisniewski | Hélène Maynard
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters)

In this paper we investigate the impact of the integration of context into dialogue translation. We present a new contextual parallel corpus of television subtitles and show how taking into account speaker gender can significantly improve machine translation quality in terms of B LEU and M ETEOR scores. We perform a manual analysis, which suggests that these improvements are not necessary related to the morphological consequences of speaker gender, but to more general linguistic divergences.

pdf bib
Zero-resource Dependency Parsing: Boosting Delexicalized Cross-lingual Transfer with Linguistic Knowledge
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

This paper studies cross-lingual transfer for dependency parsing, focusing on very low-resource settings where delexicalized transfer is the only fully automatic option. We show how to boost parsing performance by rewriting the source sentences so as to better match the linguistic regularities of the target language. We contrast a data-driven approach with an approach relying on linguistically motivated rules automatically extracted from the World Atlas of Language Structures. Our findings are backed up by experiments involving 40 languages. They show that both approaches greatly outperform the baseline, the knowledge-driven method yielding the best accuracies, with average improvements of +2.9 UAS, and up to +90 UAS (absolute) on some frequent PoS configurations.

pdf bib
Cross-lingual and Supervised Models for Morphosyntactic Annotation: a Comparison on Romanian
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Because of the small size of Romanian corpora, the performance of a PoS tagger or a dependency parser trained with the standard supervised methods fall far short from the performance achieved in most languages. That is why, we apply state-of-the-art methods for cross-lingual transfer on Romanian tagging and parsing, from English and several Romance languages. We compare the performance with monolingual systems trained with sets of different sizes and establish that training on a few sentences in target language yields better results than transferring from large datasets in other languages.

pdf bib
Frustratingly Easy Cross-Lingual Transfer for Transition-Based Dependency Parsing
Ophélie Lacroix | Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Cross-lingual Dependency Transfer : What Matters? Assessing the Impact of Pre- and Post-processing
Ophélie Lacroix | Guillaume Wisniewski | François Yvon
Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP

pdf bib
Cross-lingual alignment transfer: a chicken-and-egg story?
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP

pdf bib
LIMSI@WMT’16: Machine Translation of News
Alexandre Allauzen | Lauriane Aufrant | Franck Burlot | Ophélie Lacroix | Elena Knyazeva | Thomas Lavergne | Guillaume Wisniewski | François Yvon
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

2015

pdf bib
Apprentissage par imitation pour l’étiquetage de séquences : vers une formalisation des méthodes d’étiquetage easy-first
Elena Knyazeva | Guillaume Wisniewski | François Yvon
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

De nombreuses méthodes ont été proposées pour accélérer la prédiction d’objets structurés (tels que les arbres ou les séquences), ou pour permettre la prise en compte de dépendances plus riches afin d’améliorer les performances de la prédiction. Ces méthodes reposent généralement sur des techniques d’inférence approchée et ne bénéficient d’aucune garantie théorique aussi bien du point de vue de la qualité de la solution trouvée que du point de vue de leur critère d’apprentissage. Dans ce travail, nous étudions une nouvelle formulation de l’apprentissage structuré qui consiste à voir celui-ci comme un processus incrémental au cours duquel la sortie est construite de façon progressive. Ce cadre permet de formaliser plusieurs approches de prédiction structurée existantes. Grâce au lien que nous faisons entre apprentissage structuré et apprentissage par renforcement, nous sommes en mesure de proposer une méthode théoriquement bien justifiée pour apprendre des méthodes d’inférence approchée. Les expériences que nous réalisons sur quatre tâches de TAL valident l’approche proposée.

pdf bib
Oublier ce qu’on sait, pour mieux apprendre ce qu’on ne sait pas : une étude sur les contraintes de type dans les modèles CRF
Nicolas Pécheux | Alexandre Allauzen | Thomas Lavergne | Guillaume Wisniewski | François Yvon
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Quand on dispose de connaissances a priori sur les sorties possibles d’un problème d’étiquetage, il semble souhaitable d’inclure cette information lors de l’apprentissage pour simplifier la tâche de modélisation et accélérer les traitements. Pourtant, même lorsque ces contraintes sont correctes et utiles au décodage, leur utilisation lors de l’apprentissage peut dégrader sévèrement les performances. Dans cet article, nous étudions ce paradoxe et montrons que le manque de contraste induit par les connaissances entraîne une forme de sous-apprentissage qu’il est cependant possible de limiter.

pdf bib
Why Predicting Post-Edition is so Hard? Failure Analysis of LIMSI Submission to the APE Shared Task
Guillaume Wisniewski | Nicolas Pécheux | François Yvon
Proceedings of the Tenth Workshop on Statistical Machine Translation

2014

pdf bib
Cross-Lingual Part-of-Speech Tagging through Ambiguous Learning
Guillaume Wisniewski | Nicolas Pécheux | Souhir Gahbiche-Braham | François Yvon
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
Cross-Lingual POS Tagging through Ambiguous Learning: First Experiments (Apprentissage partiellement supervisé d’un étiqueteur morpho-syntaxique par transfert cross-lingue) [in French]
Guillaume Wisniewski | Nicolas Pécheux | Elena Knyazeva | Alexandre Allauzen | François Yvon
Proceedings of TALN 2014 (Volume 1: Long Papers)

pdf bib
A Corpus of Machine Translation Errors Extracted from Translation Students Exercises
Guillaume Wisniewski | Natalie Kübler | François Yvon
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we present a freely available corpus of automatic translations accompanied with post-edited versions, annotated with labels identifying the different kinds of errors made by the MT system. These data have been extracted from translation students exercises that have been corrected by a senior professor. This corpus can be useful for training quality estimation tools and for analyzing the types of errors made MT system.

pdf bib
LIMSI Submission for WMT’14 QE Task
Guillaume Wisniewski | Nicolas Pécheux | Alexander Allauzen | François Yvon
Proceedings of the Ninth Workshop on Statistical Machine Translation

2013

pdf bib
Design and Analysis of a Large Corpus of Post-Edited Translations: Quality Estimation, Failure Analysis and the Variability of Post-Edition
Guillaume Wisniewski | Anil Kumar Singh | Natalia Segal | François Yvon
Proceedings of Machine Translation Summit XIV: Papers

pdf bib
A corpus of post-edited translations (Un corpus d’erreurs de traduction) [in French]
Guillaume Wisniewski | Anil Kumar Singh | Natalia Segal | François Yvon
Proceedings of TALN 2013 (Volume 2: Short Papers)

pdf bib
On the Predictability of Human Assessment: when Matrix Completion Meets NLP Evaluation
Guillaume Wisniewski
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
LIMSI Submission for the WMT‘13 Quality Estimation Task: an Experiment with N-Gram Posteriors
Anil Kumar Singh | Guillaume Wisniewski | François Yvon
Proceedings of the Eighth Workshop on Statistical Machine Translation

2012

pdf bib
Non-linear n-best List Reranking with Few Features
Artem Sokolov | Guillaume Wisniewski | François Yvon
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers

In Machine Translation, it is customary to compute the model score of a predicted hypothesis as a linear combination of multiple features, where each feature assesses a particular facet of the hypothesis. The choice of a linear combination is usually justified by the possibility of efficient inference (decoding); yet, the appropriateness of this simple combination scheme to the task at hand is rarely questioned. In this paper, we propose an approach that replaces the linear scoring function with a non-linear scoring function. To investigate the applicability of this approach, we rescore n-best lists generated with a conventional machine translation engine (using a linear scoring function for generating its hypotheses) with a non-linear scoring function learned using the learning-to-rank framework. Moderate, though consistent, gains in BLEU are demonstrated on the WMT’10, WMT’11 and WMT’12 test sets.

pdf bib
Computing Lattice BLEU Oracle Scores for Machine Translation
Artem Sokolov | Guillaume Wisniewski | François Yvon
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Non-Linear Models for Confidence Estimation
Yong Zhuang | Guillaume Wisniewski | François Yvon
Proceedings of the Seventh Workshop on Statistical Machine Translation

pdf bib
LIMSI @ WMT12
Hai-Son Le | Thomas Lavergne | Alexandre Allauzen | Marianna Apidianaki | Li Gong | Aurélien Max | Artem Sokolov | Guillaume Wisniewski | François Yvon
Proceedings of the Seventh Workshop on Statistical Machine Translation

pdf bib
WSD for n-best reranking and local language modeling in SMT
Marianna Apidianaki | Guillaume Wisniewski | Artem Sokolov | Aurélien Max | François Yvon
Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation

2011

pdf bib
LIMSI @ WMT11
Alexandre Allauzen | Hélène Bonneau-Maynard | Hai-Son Le | Aurélien Max | Guillaume Wisniewski | François Yvon | Gilles Adda | Josep Maria Crego | Adrien Lardilleux | Thomas Lavergne | Artem Sokolov
Proceedings of the Sixth Workshop on Statistical Machine Translation

2010

pdf bib
Refining Word Alignment with Discriminative Training
Nadi Tomeh | Alexandre Allauzen | François Yvon | Guillaume Wisniewski
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

The quality of statistical machine translation systems depends on the quality of the word alignments that are computed during the translation model training phase. IBM alignment models, as implemented in the GIZA++ toolkit, constitute the de facto standard for performing these computations. The resulting alignments and translation models are however very noisy, and several authors have tried to improve them. In this work, we propose a simple and effective approach, which considers alignment as a series of independent binary classification problems in the alignment matrix. Through extensive feature engineering and the use of stacking techniques, we were able to obtain alignments much closer to manually defined references than those obtained by the IBM models. These alignments also yield better translation models, delivering improved performance in a large scale Arabic to English translation task.

pdf bib
LIMSI @ IWSLT 2010
Alexandre Allauzen | Josep M. Crego | İlknur Durgar El-Kahlout | Le Hai-Son | Guillaume Wisniewski | François Yvon
Proceedings of the 7th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes LIMSI’s Statistical Machine Translation systems (SMT) for the IWSLT evaluation, where we participated in two tasks (Talk for English to French and BTEC for Turkish to English). For the Talk task, we studied an extension of our in-house n-code SMT system (the integration of a bilingual reordering model over generalized translation units), as well as the use of training data extracted from Wikipedia in order to adapt the target language model. For the BTEC task, we concentrated on pre-processing schemes on the Turkish side in order to reduce the morphological discrepancies with the English side. We also evaluated the use of two different continuous space language models for such a small size of training data.

pdf bib
Recueil et analyse d’un corpus écologique de corrections orthographiques extrait des révisions de Wikipédia
Guillaume Wisniewski | Aurélien Max | François Yvon
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Dans cet article, nous introduisons une méthode à base de règles permettant d’extraire automatiquement de l’historique des éditions de l’encyclopédie collaborative Wikipédia des corrections orthographiques. Cette méthode nous a permis de construire un corpus d’erreurs composé de 72 483 erreurs lexicales (non-word errors) et 74 100 erreurs grammaticales (real-word errors). Il n’existe pas, à notre connaissance, de plus gros corpus d’erreurs écologiques librement disponible. En outre, les techniques mises en oeuvre peuvent être facilement transposées à de nombreuses autres langues. La collecte de ce corpus ouvre de nouvelles perspectives pour l’étude des erreurs fréquentes ainsi que l’apprentissage et l’évaluation des correcteurs orthographiques automatiques. Plusieurs expériences illustrant son intérêt sont proposées.

pdf bib
Training Continuous Space Language Models: Some Practical Issues
Hai Son Le | Alexandre Allauzen | Guillaume Wisniewski | François Yvon
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

pdf bib
Assessing Phrase-Based Translation Models with Oracle Decoding
Guillaume Wisniewski | Alexandre Allauzen | François Yvon
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

pdf bib
Mining Naturally-occurring Corrections and Paraphrases from Wikipedia’s Revision History
Aurélien Max | Guillaume Wisniewski
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Naturally-occurring instances of linguistic phenomena are important both for training and for evaluating automatic text processing. When available in large quantities, they also prove interesting material for linguistic studies. In this article, we present WiCoPaCo (Wikipedia Correction and Paraphrase Corpus), a new freely-available resource built by automatically mining Wikipedia’s revision history. The WiCoPaCo corpus focuses on local modifications made by human revisors and include various types of corrections (such as spelling error or typographical corrections) and rewritings, which can be categorized broadly into meaning-preserving and meaning-altering revisions. We present an initial hand-built typology of these revisions, but the resource allows for any possible annotation scheme. We discuss the main motivations for building such a resource and describe the main technical details guiding its construction. We also present applications and data analysis on French and report initial results on spelling error correction and morphosyntactic rewriting. The WiCoPaCo corpus can be freely downloaded from http://wicopaco.limsi.fr.

2009

pdf bib
Modèles discriminants pour l’alignement mot à mot [Discriminant Models for Word Alignment]
Alexandre Allauzen | Guillaume Wisniewski
Traitement Automatique des Langues, Volume 50, Numéro 3 : Apprentissage automatique pour le TAL [Machine Learning for NLP]