François Yvon - ACL Anthology

François Yvon

Also published as: Francois Yvon

2026

AdaptBPE: From General Purpose to Specialized Tokenizers
Vijini Pilana Liyanage | François Yvon
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Subword tokenization methods, such as Byte-Pair Encoding (BPE), significantly impact the performance and efficiency of large language models (LLMs). The standard approach involves training a general-purpose tokenizer that uniformly processes all textual data during both training and inference. However, the use of a generic set of tokens can incur inefficiencies when applying the model to specific domains or languages. To address this limitation, we propose a post-training adaptation strategy that selectively replaces low-utility tokens with more relevant ones based on their frequency in an adaptation corpus. Our algorithm identifies the token inventory that most effectively encodes the adaptation corpus for a given target vocabulary size. Extensive experiments on generation and classification tasks across multiple languages demonstrate that our adapted tokenizers compress test corpora more effectively than baselines using the same vocabulary size. This method serves as a lightweight adaptation mechanism, akin to a vocabulary fine-tuning process, enabling optimized tokenization for specific domains or tasks. Our code and data are available at https://github.com/vijini/Adapt-BPE.git.

Polyglots or Multitudes? Multilingual LLM Answers to Value-laden Multiple-Choice Questions
Léo Labat | Etienne Ollion | François Yvon
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Multiple-Choice Questions (MCQs) are often used to assess knowledge, reasoning abilities, and even values encoded in large language models (LLMs). While the effect of multilingualism has been studied on LLM factual recall, this paper seeks to investigate the less explored question of language-induced variation in value-laden MCQ responses. Are multilingual LLMs consistent in their responses across languages, i.e. behave like theoretical polyglots, or do they answer value-laden MCQs depending on the language of the question, like a multitude of monolingual models expressing different values through a single model? We release a new corpus, the Multilingual European Value Survey (MEVS), which, unlike prior work relying on machine translation or ad hoc prompts, solely comprises human-translated survey questions aligned in 8 European languages. We administer a subset of those questions to over thirty multilingual LLMs of various sizes, manufacturers and alignment-fine-tuning status under comprehensive, controlled prompt variations including answer order, symbol type, and tail character. Our results show that while larger, instruction-tuned models display higher overall consistency, the robustness of their responses varies greatly across questions, with certain MCQs eliciting total agreement within and across models while others leave LLM answers split. Language-specific behavior seems to arise in all consistent, instruction-fine-tuned models, but only on certain questions, warranting a further study of the selective effect of preference fine-tuning.

2025

How Programming Concepts and Neurons Are Shared in Code Language Models
Amir Hossein Kargaran | Yihong Liu | François Yvon | Hinrich Schuetze
Findings of the Association for Computational Linguistics: ACL 2025

Several studies have explored the mechanisms of large language models (LLMs) in coding tasks, but most have focused on programming languages (PLs) in a monolingual setting. In this paper, we investigate the relationship between multiple PLs and English in the concept space of LLMs. We perform a few-shot translation task on 21 PL pairs using two Llama-based models. By decoding the embeddings of intermediate layers during this task, we observe that the concept space is closer to English (including PL keywords) and assigns high probabilities to English tokens in the second half of the intermediate layers. We analyze neuron activations for 11 PLs and English, finding that while language-specific neurons are primarily concentrated in the bottom layers, those exclusive to each PL tend to appear in the top layers. For PLs that are highly aligned with multiple other PLs, identifying language-specific neurons is not feasible. These PLs also tend to have a larger keyword set than other PLs and are closer to the model’s concept space regardless of the input/output PL in the translation task. Our findings provide insights into how LLMs internally represent PLs, revealing structural patterns in the model’s concept space. Code is available at https://github.com/cisnlp/code-specific-neurons.

How Sampling Affects the Detectability of Machine-written texts: A Comprehensive Study
Matthieu Dubois | François Yvon | Pablo Piantanida
Findings of the Association for Computational Linguistics: EMNLP 2025

As texts generated by Large Language Models (LLMs) are ever more common and often indistinguishable from human-written content, research on automatic text detection has attracted growing attention. Many recent detectors report near-perfect accuracy, often boasting AUROC scores above 99%. However, these claims typically assume fixed generation settings, leaving open the question of how robust such systems are to changes in decoding strategies. In this work, we systematically examine how sampling-based decoding impacts detectability, with a focus on how subtle variations in a model’s (sub)word-level distribution affect detection performance. We find that even minor adjustments to decoding parameters - such as temperature, top-p, or nucleus sampling - can severely impair detector accuracy, with AUROC dropping from near-perfect levels to 1% in some settings. Our findings expose critical blind spots in current detection methods and emphasize the need for more comprehensive evaluation protocols. To facilitate future research, we release a large-scale dataset encompassing 37 decoding configurations, along with our code and evaluation framework https://github.com/BaggerOfWords/Sampling-and-Detection.

Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu
Renhao Pei | Yihong Liu | Peiqin Lin | François Yvon | Hinrich Schuetze
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In-context machine translation (MT) with large language models (LLMs) is a promising approach for low-resource MT, as it can readily take advantage of linguistic resources such as grammar books and dictionaries.Such resources are usually selectively integrated into the prompt so that LLMs can directly perform translation without any specific training, via their in-context learning capability (ICL).However, the relative importance of each type of resource, e.g., dictionary, grammar book, and retrieved parallel examples, is not entirely clear.To address this gap, this study systematically investigates how each resource and its quality affect the translation performance, with the Manchu language as our case study. To remove any prior knowledge of Manchu encoded in the LLM parameters and single out the effect of ICL, we also experiment with an enciphered version of Manchu texts.Our results indicate that high-quality dictionaries and good parallel examples are very helpful, while grammars hardly help.In a follow-up study, we showcase a promising application of in-context MT: parallel data augmentation as a way to bootstrap a conventional MT model. When monolingual data abound, generating synthetic parallel data through in-context MT offers a pathway to mitigate data scarcity and build effective and efficient low-resource neural MT systems.

MOSAIC at GENAI Detection Task 3 : Zero-Shot Detection Using an Ensemble of Models
Matthieu Dubois | François Yvon | Pablo Piantanida
Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect)

MOSAIC introduces a new ensemble approach that combines several detector models to spot AI-generated texts. The method enhances the reliability of detection by integrating insights from multiple models, thus addressing the limitations of using a single detector model which often results in performance brittleness. This approach also involves using a theoretically grounded algorithm to minimize the worst-case expected encoding size across models, thereby optimizing the detection process. In this submission, we report evaluation results on the RAID benchmark, a comprehensive English-centric testbed for machine-generated texts. These results were obtained in the context of the “Cross-domain Machine-Generated Text Detection” shared task. We show that our model can be competitive for a variety of domains and generator models, but that it can be challenged by adversarial attacks and by changes in the text generation strategy.

On Relation-Specific Neurons in Large Language Models
Yihong Liu | Runsheng Chen | Lea Hirlimann | Ahmad Dawar Hakimi | Mingyang Wang | Amir Hossein Kargaran | Sascha Rothe | François Yvon | Hinrich Schuetze
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

In large language models (LLMs), certain neurons can store distinct pieces of knowledge learned during pretraining. While factual knowledge typically appears as a combination of relations and entities, it remains unclear whether some neurons focus on a relation itself – independent of any entity. We hypothesize such neurons detect a relation in the input text and guide generation involving such a relation. To investigate this, we study the LLama-2 family on a chosen set of relations, with a statistics-based method. Our experiments demonstrate the existence of relation-specific neurons. We measure the effect of selectively deactivating candidate neurons specific to relation r on the LLM’s ability to handle (1) facts involving relation r and (2) facts involving a different relation r' ≠ r. With respect to their capacity for encoding relation information, we give evidence for the following three properties of relation-specific neurons. (i) Neuron cumulativity. Multiple neurons jointly contribute to processing facts involving relation r, with no single neuron fully encoding a fact in r on its own. (ii) Neuron versatility. Neurons can be shared across multiple closely related as well as less related relations. In addition, some relation neurons transfer across languages. (iii) Neuron interference. Deactivating neurons specific to one relation can improve LLMs’ factual recall performance for facts of other relations. We make our code and data publicly available at https://github.com/cisnlp/relation-specific-neurons.

Prompting LLMs: Length Control for Isometric Machine Translation
Dávid Javorský | Ondřej Bojar | François Yvon
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)

In this study, we explore the effectiveness of isometric machine translation across multiple language pairs (EnoDe, EnoFr, and EnoEs) under the conditions of the IWSLT Isometric Shared Task 2022. Using eight open-source large language models (LLMs) of varying sizes, we investigate how different prompting strategies, varying numbers of few-shot examples, and demonstration selection influence translation quality and length control. We discover that the phrasing of instructions, when aligned with the properties of the provided demonstrations, plays a crucial role in controlling the output length. Our experiments show that LLMs tend to produce shorter translations only when presented with extreme examples, while isometric demonstrations often lead to the models disregarding length constraints. While few-shot prompting generally enhances translation quality, further improvements are marginal across 5, 10, and 20-shot settings. Finally, considering multiple outputs allows to notably improve overall tradeoff between the length and quality, yielding state-of-the-art performance for some language pairs.

Machine Translation (MT) tools are widely used today, often in contexts where professional translators are not present. Despite progress in MT technology, a gap persists between system development and real-world usage, particularly for non-expert users who may struggle to assess translation reliability.This paper advocates for a human-centered approach to MT, emphasizing the alignment of system design with diverse communicative goals and contexts of use. We survey the literature in Translation Studies and Human-Computer Interaction to recontextualize MT evaluation and design to address the diverse real-world scenarios in which MT is used today.

Unlike “Likely”, “Unlike” is Unlikely: BPE-based Segmentation hurts Morphological Derivations in LLMs
Paul Lerner | François Yvon
Proceedings of the 31st International Conference on Computational Linguistics

Large Language Models (LLMs) rely on subword vocabularies to process and generate text. However, because subwords are marked as initial- or intra-word, we find that LLMs perform poorly at handling some types of affixations, which hinders their ability to generate novel (unobserved) word forms. The largest models trained on enough data can mitigate this tendency because their initial- and intra-word embeddings are aligned; in-context learning also helps when all examples are selected in a consistent way; but only morphological segmentation can achieve a near-perfect accuracy.

This paper is a short presentation of MaTOS, a project focusing on the automatic translation of scholarly documents. Its main aims are threefold: (a) to develop resources (term lists and corpora) for high-quality machine translation; (b) to study methods for translating complete, structured documents in a cohesive and consistent manner; (c) to propose novel metrics to evaluate machine translation in technical domains. Publications and resources are available on the project web site: https://anr-matos.gihub.io.

MockConf: A Student Interpretation Dataset: Analysis, Word- and Span-level Alignment and Baselines
Dávid Javorský | Ondřej Bojar | François Yvon
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In simultaneous interpreting, an interpreter renders the speech into another language with a very short lag, much sooner than sentences are finished. In order to understand and later reproduce this dynamic and complex task automatically, we need specialized datasets and tools for analysis, monitoring, and evaluation, such as parallel speech corpora, and tools for their automatic annotation. Existing parallel corpora of translated texts and associated alignment algorithms hardly fill this gap, as they fail to model long-range interactions between speech segments or specific types of divergences (e.g. shortening, simplification, functional generalization) between the original and interpreted speeches. In this work, we develop and explore MockConf, a student interpretation dataset that was collected from Mock Conferences run as part of the students’ curriculum. This dataset contains 7 hours of recordings in 5 European languages, transcribed and aligned at the level of spans and words. We further implement and release InterAlign, a modern web-based annotation tool for parallel word and span annotations on long inputs, suitable for aligning simultaneous interpreting. We propose metrics for the evaluation and a baseline for automatic alignment. Dataset and tools will be released to the community.

Améliorer la Traduction Neuronale par Exemple avec des Données Monolingues
Maxime Bouthors | Josep Crego | François Yvon
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : articles scientifiques originaux

Les systèmes de traduction neuronale augmentée par des exemples (RANMT) utilisent des corpus bilingues dits mémoires de traduction (TM). Pourtant, dans de nombreux cas, des corpus monolingues du domaine d’intérêt dans la langue cible sont disponibles. Nos travaux s’intéressent à l’exploitation de telles ressources, en recherchant les segments pertinents directement dans la langue cible, conditionnellement à une phrase source en requête. À cet effet, nous proposons d’améliorer les systèmes de recherche cross-lingue, en les entraînant à réaliser des association lexicales. Nos expériences avec deux architectures neuronales montrent l’avantage de notre méthode dans un cas contrôlé, conduisant à des performances de traduction qui peuvent surpasser les méthodes basées sur une mémoire de traduction. Enfin, nous évaluons notre méthode dans une configuration réaliste pour laquelle la quantité de données monolingues excède celle des données parallèles. Cette approche résulte en une nette amélioration des performances par rapport à des modèles de base ainsi que des encodeurs pré-entraînés.

Investigating Length Issues in Document-level Machine Translation
Ziqian Peng | Rachel Bawden | François Yvon
Proceedings of Machine Translation Summit XX: Volume 1

Transformer architectures are increasingly effective at processing and generating very long chunks of texts, opening new perspectives for document-level machine translation (MT). In this work, we challenge the ability of MT systems to handle texts comprising up to several thousands of tokens. We design and implement a new approach designed to precisely measure the effect of length increments on MT outputs. Our experiments with two representative architectures unambiguously show that (a) translation performance decreases with the length of the input text; (b) the position of sentences within the document matters and translation quality is higher for sentences occurring earlier in a document. We further show that manipulating the distribution of document lengths and of positional embeddings only marginally mitigates such problems. Our results suggest that even though document-level MT is computationally feasible, it does not yet match the performance of sentence-based MT.

Comment mesurer les biais politiques des grands modèles de langue multilingues?
Paul Lerner | Laurène Cave | Hal Daumé | Léo Labat | Gaël Lejeune | Pierre-Antoine Lequeu | Benjamin Piwowarski | Nazanin Shafiabadi | François Yvon
Actes de l'atelier Ethic and Alignment of (Large) Language Models 2025 (EALM)

Nous proposons une nouvelle méthode pour mesurer les biais politiques des grands modèles de langue multilingues pour la traduction automatique, l’aide à la rédaction et le résumé automatique. Nous nous appuyons sur une représentation dense des opinions politiques exprimées dans les textes, apprise de façon faiblement supervisée.

MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment
Amir Hossein Kargaran | Ali Modarressi | Nafiseh Nikeghbal | Jana Diesner | François Yvon | Hinrich Schuetze
Findings of the Association for Computational Linguistics: ACL 2025

English-centric large language models (LLMs) often show strong multilingual capabilities. However, their multilingual performance remains unclear and is under-evaluated for many other languages. Most benchmarks for multilinguality focus on classic NLP tasks or cover a minimal number of languages. We introduce MEXA, a method for assessing the multilingual capabilities of pre-trained English-centric LLMs using parallel sentences, which are available for more languages than existing downstream tasks. MEXA leverages that English-centric LLMs use English as a pivot language in their intermediate layers. MEXA computes the alignment between English and non-English languages using parallel sentences to evaluate the transfer of language understanding from English to other languages. This alignment can be used to estimate model performance in different languages. We conduct controlled experiments using various parallel datasets (FLORES-200 and Bible), models (Llama family, Gemma family, Mistral, and OLMo), and established downstream tasks (Belebele, m-MMLU, and m-ARC). We explore different methods to compute embeddings in decoder-only models. Our results show that MEXA, in its default settings, achieves an average Pearson correlation of 0.90 between its predicted scores and actual task performance across languages. This suggests that MEXA is a reliable method for estimating the multilingual capabilities of English-centric LLMs, providing a clearer understanding of their multilingual potential and the inner workings of LLMs. Leaderboard: https://cis-lmu-mexa.hf.space, Code: https://github.com/cisnlp/MEXA.

Self-Retrieval from Distant Contexts for Document-Level Machine Translation
Ziqian Peng | Rachel Bawden | François Yvon
Proceedings of the Tenth Conference on Machine Translation

Document-level machine translation is a challenging task, as it requires modeling both short-range and long-range dependencies to maintain the coherence and cohesion of the generated translation. However, these dependencies are sparse, and most context-augmented translation systems resort to two equally unsatisfactory options: either to include maximally long contexts, hoping that the useful dependencies are not lost in the noise; or to use limited local contexts, at the risk of missing relevant information. In this work, we study a self-retrieval-augmented machine translation framework (Self-RAMT), aimed at informing translation decisions with informative local and global contexts dynamically extracted from the source and target texts. We examine the effectiveness of this method using three large language models, considering three criteria for context selection. We carry out experiments on TED talks as well as parallel scientific articles, considering three translation directions. Our results show that integrating distant contexts with Self-RAMT improves translation quality as measured by reference-based scores and consistency metrics.

Towards the Machine Translation of Scientific Neologisms
Paul Lerner | François Yvon
Proceedings of the 31st International Conference on Computational Linguistics

Scientific research continually discovers and invents new concepts, which are then referred to by new terms, neologisms, or neonyms in this context. As the vast majority of publications are written in English, disseminating this new knowledge to the general public often requires translating these terms. However, by definition, no parallel data exist to provide such translations. Therefore, we propose to leverage term definitions as a useful source of information for the translation process. As we discuss, Large Language Models are well suited for this task and can benefit from in-context learning with co-hyponyms and terms sharing the same derivation paradigm. These models, however, are sensitive to the superficial and morphological similarity between source and target terms. Their predictions are also impacted by subword tokenization, especially for prefixed terms.

MOSAIC: Multiple Observers Spotting AI Content
Matthieu Dubois | François Yvon | Pablo Piantanida
Findings of the Association for Computational Linguistics: ACL 2025

The dissemination of Large Language Models (LLMs), trained at scale, and endowed with powerful text-generating abilities, has made it easier for all to produce harmful, toxic, faked or forged content. In response, various proposals have been made to automatically discriminate artificially generated from human-written texts, typically framing the problem as a binary classification problem. Early approaches evaluate an input document with a well-chosen detector LLM, assuming that low-perplexity scores reliably signal machine-made content. More recent systems instead consider two LLMs and compare their probability distributions over the document to further discriminate when perplexity alone cannot. However, using a fixed pair of models can induce brittleness in performance. We extend these approaches to the ensembling of several LLMs and derive a new, theoretically grounded approach to combine their respective strengths. Our experiments, using a variety of generator LLMs, suggest that this approach effectively harnesses each model’s capabilities, leading to strong detection performance on a variety of domains.

Alignements divisifs de textes parallèles: données, algorithme et évaluation
Joanna Radoła | François Yvon
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : articles scientifiques originaux

Nous présentons Alibi - un corpus d’alignements hiérarchiques sous-phrastiques français-anglais, annoté manuellement à l’aide d’une stratégie divisive. Nous comparons globalement les alignements ainsi obtenus avec plusieurs corpus parallèles alignés mot-à-mot et étalonnons sa difficulté en réalisant des alignements automatiques par des méthodes de l’état de l’art. Nous proposons également un algorithme exploitant des représentations neuronales des mots et des groupes de mots afin de repro- duire les alignements hiérarchiques de référence. Enfin, nous proposons une métrique d’évaluation des arbres d’alignement avec laquelle nous comparons les performances de plusieurs variantes de l’algorithme d’alignement, obtenues en faisant varier les mesures d’appariemment de groupes de mots. Nos résultats montrent que (a) les arbres d’alignements de référence sont très ambigus et difficiles à reproduire automatiquement, cependant, les alignements mot-à-mot sont prédits de manière fiable ; (b) l’utilisation d’alternatives à la similarité cosinus pour évaluer l’appariemment de blocs permet d’améliorer significativement les résultats du système de base.

Tracing Multilingual Factual Knowledge Acquisition in Pretraining
Yihong Liu | Mingyang Wang | Amir Hossein Kargaran | Felicia Körner | Ercong Nie | Barbara Plank | François Yvon | Hinrich Schuetze
Findings of the Association for Computational Linguistics: EMNLP 2025

Large Language Models (LLMs) are capable of recalling multilingual factual knowledge present in their pretraining data. However, most studies evaluate only the final model, leaving the development of factual recall and crosslingual consistency throughout pretraining largely unexplored. In this work, we trace how factual recall and crosslingual consistency evolve during pretraining, focusing on OLMo-7B as a case study. We find that both accuracy and consistency improve over time for most languages. We show that this improvement is primarily driven by the fact frequency in the pretraining corpus: more frequent facts are more likely to be recalled correctly, regardless of language. Yet, some low-frequency facts in non-English languages can still be correctly recalled. Our analysis reveals that these instances largely benefit from crosslingual transfer of their English counterparts – an effect that emerges predominantly in the early stages of pretraining. We pinpoint two distinct pathways through which multilingual factual knowledge acquisition occurs: (1) frequency-driven learning, which is dominant and language-agnostic, and (2) crosslingual transfer, which is limited in scale and typically constrained to relation types involving named entities. We release our code and data to facilitate further research at https://github.com/cisnlp/multilingual-fact-tracing.

MOSAIC : Mélange d’experts pour la détection de textes artificiels
Matthieu Dubois | Pablo Piantanida | François Yvon
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : articles scientifiques originaux

La diffusion auprès du grand public de grands modèles de langue facilite la production de contenus nuisibles, médisants, malhonnêtes ou falsifiés. En réponse, plusieurs solutions ont été proposées pour identifier les textes ainsi produits, en traitant le problème comme une tâche de classification binaire. Les premières approches reposent sur l’analyse d’un document par un modèle détecteur, avec l’hypothèse qu’un faible score de perplexité indique que le contenu est artificiel. Des méthodes plus récentes proposent de comparer les distributions de probabilité calculées par deux modèles. Cependant, s’appuyer sur une paire fixe de modèles peut fragiliser les performances. Nous étendons ces méthodes en combinant plusieurs modèles et en développant une approche théoriquement fondée pour exploiter au mieux chacun d’entre eux.

How Transliterations Improve Crosslingual Alignment
Yihong Liu | Mingyang Wang | Amir Hossein Kargaran | Ayyoob ImaniGooghari | Orgest Xhelili | Haotian Ye | Chunlan Ma | François Yvon | Hinrich Schütze
Proceedings of the 31st International Conference on Computational Linguistics

Recent studies have shown that post-aligning multilingual pretrained language models (mPLMs) using alignment objectives on both original and transliterated data can improve crosslingual alignment. This improvement further leads to better crosslingual transfer performance. However, it remains unclear how and why a better crosslingual alignment is achieved, as this technique only involves transliterations, and does not use any parallel data. This paper attempts to explicitly evaluate the crosslingual alignment and identify the key elements in transliteration-based approaches that contribute to better performance. For this, we train multiple models under varying setups for two pairs of related languages: (1) Polish and Ukrainian and (2) Hindi and Urdu. To assess alignment, we define four types of similarities based on sentence representations. Our experimental results show that adding transliterations alone improves the overall similarities, even for random sentence pairs. With the help of auxiliary transliteration-based alignment objectives, especially the contrastive objective, the model learns to distinguish matched from random pairs, leading to better crosslingual alignment. However, we also show that better alignment does not always yield better downstream performance, suggesting that further research is needed to clarify the connection between alignment and performance. The code implementation is based on https://github.com/cisnlp/Transliteration-PPA.

2024

Retrieving Examples from Memory for Retrieval Augmented Neural Machine Translation: A Systematic Comparison
Maxime Bouthors | Josep Crego | François Yvon
Findings of the Association for Computational Linguistics: NAACL 2024

Retrieval-Augmented Neural Machine Translation (RAMT) architectures retrieve examples from memory to guide the generation process. While most works in this trend explore new ways to exploit the retrieved examples, the upstream retrieval step is mostly unexplored. In this paper, we study the effect of varying retrieval methods for several translation architectures to better understand the interplay between these two processes.We conduct experiments in two language pairs in a multi-domain setting and consider several downstream architectures based on a standard autoregressive model, an edit-based model, and a large language model with in-context learning. Our experiments show that the choice of the retrieval technique impacts the translation scores, with variance across architectures. We also discuss the effects of increasing the number and diversity of examples, which are mostly positive across the board.

GlotScript: A Resource and Tool for Low Resource Writing System Identification
Amir Hossein Kargaran | François Yvon | Hinrich Schütze
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present GlotScript, an open resource and tool for low resource writing system identification. GlotScript-R is a resource that provides the attested writing systems for more than 7,000 languages. It is compiled by aggregating information from existing writing system resources. GlotScript-T is a writing system identification tool that covers all 161 Unicode 15.0 scripts. For an input text, it returns its script distribution where scripts are identified by ISO 15924 codes. We also present two use cases for GlotScript. First, we demonstrate that GlotScript can help cleaning multilingual corpora such as mC4 and OSCAR. Second, we analyze the tokenization of a number of language models such as GPT-4 using GlotScript and provide insights on the coverage of low resource scripts and languages by each language model. We hope that GlotScript will become a useful resource for work on low resource languages in the NLP community. GlotScript-R and GlotScript-T are available at https://github.com/cisnlp/GlotScript.

Optimiser le choix des exemples pour la traduction automatique augmentée par des mémoires de traduction
Maxime Bouthors | Josep Crego | François Yvon
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position

La traduction neuronale à partir d’exemples s’appuie sur l’exploitation d’une mémoire de traduction contenant des exemples similaires aux phrases à traduire. Ces exemples sont utilisés pour conditionner les prédictions d’un décodeur neuronal. Nous nous intéressons à l’amélioration du système qui effectue l’étape de recherche des phrases similaires, l’architecture du décodeur neuronal étant fixée et reposant ici sur un modèle explicite d’édition, le Transformeur multi-Levenshtein. Le problème considéré consiste à trouver un ensemble optimal d’exemples similaires, c’est-à-dire qui couvre maximalement la phrase source. En nous appuyant sur la théorie des fonctions sous-modulaires, nous explorons de nouveaux algorithmes pour optimiser cette couverture et évaluons les améliorations de performances auxquels ils mènent pour la tâche de traduction automatique.

À propos des difficultés de traduire automatiquement de longs documents
Ziqian Peng | Rachel Bawden | François Yvon
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position

Les nouvelles architectures de traduction automatique sont capables de traiter des segments longs et de surpasser la traduction de phrases isolées, laissant entrevoir la possibilité de traduire des documents complets. Pour y parvenir, il est nécessaire de surmonter un certain nombre de difficultés liées à la longueur des documents à traduire. Dans cette étude, nous discutons de la traduction des documents sous l’angle de l’évaluation, en essayant de répondre à une question simple: comment mesurer s’il existe une dégradation des performances de traduction avec la longueur des documents ? Nos analyses, qui évaluent des systèmes encodeur-décodeur et un grand modèle de langue à l’aune de plusieurs métriques sur une tâche de traduction de documents scientifiques suggèrent que traduire les documents longs d’un bloc reste un problème difficile.

Vers la traduction automatique des néologismes scientifiques
Paul Lerner | François Yvon
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position

La recherche scientifique découvre et invente continuellement de nouveaux concepts qui sont alors désignés par de nouveaux termes, des néologismes, ou néonymes dans ce contexte. Puisque les publications se font très majoritairement en anglais, diffuser ces nouvelles connaissances en français demande souvent de traduire ces termes, afin d’éviter de multiplier les anglicismes qui sont moins facilement compréhensibles pour le grand public. Nous proposons d’explorer cette tâche à partir de deux thésaurus en exploitant la définition du terme afin de le traduire plus fidèlement. Pour ce faire, nous explorons les capacités de deux grands modèles de langue multilingues, BLOOM et CroissantLLM, qui parviennent à traduire des néologismes scientifiques dans une certaine mesure. Nous montrons notamment qu’ils utilisent souvent des procédés morphosyntaxiques appropriés mais sont limités par la segmentation en unités sous-lexicales et biaisés par la fréquence d’occurrences des termes ainsi que par des similarités de surface entre l’anglais et le français.

MaskLID: Code-Switching Language Identification through Iterative Masking
Amir Hossein Kargaran | François Yvon | Hinrich Schuetze
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We present MaskLID, a simple, yet effective, code-switching (CS) language identification (LID) method. MaskLID does not require any training and is designed to complement current high-performance sentence-level LIDs. Sentence-level LIDs are classifiers trained on monolingual texts to provide single labels, typically using a softmax layer to turn scores into probabilities. However, in cases where a sentence is composed in both L1 and L2 languages, the LID classifier often only returns the dominant label L1. To address this limitation, MaskLID employs a strategy to mask text features associated with L1, allowing the LID to classify the text as L2 in the next round. This method uses the LID itself to identify the features that require masking and does not rely on any external resource. In this work, we explore the use of MaskLID for two open-source LIDs (GlotLID and OpenLID), that are both based on the FastText architecture. Code and demo are available at https://github.com/cisnlp/MaskLID.

Invited Talk: The Way Towards Massively Multilingual Language Models
François Yvon
Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024

Translate your Own: a Post-Editing Experiment in the NLP domain
Rachel Bawden | Ziqian Peng | Maud Bénard | Éric Clergerie | Raphaël Esamotunu | Mathilde Huguin | Natalie Kübler | Alexandra Mestivier | Mona Michelot | Laurent Romary | Lichao Zhu | François Yvon
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)

The improvements in neural machine translation make translation and post-editing pipelines ever more effective for a wider range of applications. In this paper, we evaluate the effectiveness of such a pipeline for the translation of scientific documents (limited here to article abstracts). Using a dedicated interface, we collect, then analyse the post-edits of approximately 350 abstracts (English→French) in the Natural Language Processing domain for two groups of post-editors: domain experts (academics encouraged to post-edit their own articles) on the one hand and trained translators on the other. Our results confirm that such pipelines can be effective, at least for high-resource language pairs. They also highlight the difference in the post-editing strategy of the two subgroups. Finally, they suggest that working on term translation is the most pressing issue to improve fully automatic translations, but that in a post-editing setup, other error types can be equally annoying for post-editors.

2023

MaTOS: Traduction automatique pour la science ouverte
Maud Bénard | Alexandra Mestivier | Natalie Kubler | Lichao Zhu | Rachel Bawden | Eric De La Clergerie | Laurent Romary | Mathilde Huguin | Jean-François Nominé | Ziqian Peng | François Yvon
Actes de CORIA-TALN 2023. Actes de l'atelier "Analyse et Recherche de Textes Scientifiques" (ARTS)@TALN 2023

Cette contribution présente le projet MaTOS (Machine Translation for Open Science), qui vise à développer de nouvelles méthodes pour la traduction automatique (TA) intégrale de documents scientifiques entre le français et l’anglais, ainsi que des métriques automatiques pour évaluer la qualité des traductions produites. Pour ce faire, MaTOS s’intéresse (a) au recueil de ressources ouvertes pour la TA spécialisée; (b) à la description des marqueurs de cohérence textuelle pour les articles scientifiques; (c) au développement de nouvelles méthodes de traitement multilingue pour les documents; (d) aux métriques mesurant les progrès de la traduction de documents complets.

Production automatique de gloses interlinéaires à travers un modèle probabiliste exploitant des alignements
Shu Okabe | François Yvon
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux -- articles longs

La production d’annotations linguistiques ou gloses interlinéaires explicitant le sens ou la fonction de chaque unité repérée dans un enregistrement source (ou dans sa transcription) est une étape importante du processus de documentation des langues. Ces gloses exigent une très grande expertise de la langue documentée et un travail d’annotation fastidieux. Notre étude s’intéresse à l’automatisation partielle de ce processus. Il s’appuie sur la partition des gloses en deux types : les gloses grammaticales exprimant une fonction grammaticale, les gloses lexicales indiquant les unités de sens. Notre approche repose sur l’hypothèse d’un alignement entre les gloses lexicales et une traduction ainsi que l’utilisation de Lost, un modèle probabiliste de traduction automatique. Nos expériences sur une langue en cours de documentation, le tsez, montrent que cet apprentissage est effectif même avec un faible nombre de phrases de supervision.

Joint Word and Morpheme Segmentation with Bayesian Non-Parametric Models
Shu Okabe | François Yvon
Findings of the Association for Computational Linguistics: EACL 2023

Language documentation often requires segmenting transcriptions of utterances collected on the field into words and morphemes. While these two tasks are typically performed in succession, we study here Bayesian models for simultaneously segmenting utterances at these two levels. Our aim is twofold: (a) to study the effect of explicitly introducing a hierarchy of units in joint segmentation models; (b) to further assess whether these two levels can be better identified through weak supervision. For this, we first consider a deterministic coupling between independent models; then design and evaluate hierarchical Bayesian models. Experiments with two under-resourced languages (Japhug and Tsez) allow us to better understand the value of various types of weak supervision. In our analysis, we use these results to revisit the distributional hypotheses behind Bayesian segmentation models and evaluate their validity for language documentation data.

BiSync: A Bilingual Editor for Synchronized Monolingual Texts
Josep Crego | Jitao Xu | François Yvon
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

In our globalized world, a growing number of situations arise where people are required to communicate in one or several foreign languages. In the case of written communication, users with a good command of a foreign language may find assistance from computer-aided translation (CAT) technologies. These technologies often allow users to access external resources, such as dictionaries, terminologies or bilingual concordancers, thereby interrupting and considerably hindering the writing process. In addition, CAT systems assume that the source sentence is fixed and also restrict the possible changes on the target side. In order to make the writing process smoother, we present BiSync, a bilingual writing assistant that allows users to freely compose text in two languages, while maintaining the two monolingual texts synchronized. We also include additional functionalities, such as the display of alternative prefix translations and paraphrases, which are intended to facilitate the authoring of texts. We detail the model architecture used for synchronization and evaluate the resulting tool, showing that high accuracy can be attained with limited computational resources. The interface and models are publicly available at https://github.com/jmcrego/BiSync and a demonstration video can be watched on YouTube https://youtu.be/_l-ugDHfNgU.

Towards Multilingual Interlinear Morphological Glossing
Shu Okabe | François Yvon
Findings of the Association for Computational Linguistics: EMNLP 2023

Interlinear Morphological Glosses are annotations produced in the context of language documentation. Their goal is to identify morphs occurring in an L1 sentence and to explicit their function and meaning, with the further support of an associated translation in L2. We study here the task of automatic glossing, aiming to provide linguists with adequate tools to facilitate this process. Our formalisation of glossing uses a latent variable Conditional Random Field (CRF), which labels the L1 morphs while simultaneously aligning them to L2 words. In experiments with several under-resourced languages, we show that this approach is both effective and data-efficient and mitigates the problem of annotating unknown morphs. We also discuss various design choices regarding the alignment process and the selection of features. We finally demonstrate that it can benefit from multilingual (pre-)training, achieving results which outperform very strong baselines.

Assessing Word Importance Using Models Trained for Semantic Tasks
Dávid Javorský | Ondřej Bojar | François Yvon
Findings of the Association for Computational Linguistics: ACL 2023

Many NLP tasks require to automatically identify the most significant words in a text. In this work, we derive word significance from models trained to solve semantic task: Natural Language Inference and Paraphrase Identification. Using an attribution method aimed to explain the predictions of these models, we derive importance scores for each input token. We evaluate their relevance using a so-called cross-task evaluation: Analyzing the performance of one model on an input masked according to the other model’s weight, we show that our method is robust with respect to the choice of the initial task. Additionally, we investigate the scores from the syntax point of view and observe interesting patterns, e.g. words closer to the root of a syntactic tree receive higher importance scores. Altogether, these observations suggest that our method can be used to identify important words in sentences without any explicit word importance labeling in training.

LISN @ SIGMORPHON 2023 Shared Task on Interlinear Glossing
Shu Okabe | François Yvon
Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology

This paper describes LISN”’“s submission to the second track (open track) of the shared task on Interlinear Glossing for SIGMORPHON 2023. Our systems are based on Lost, a variation of linear Conditional Random Fields initially developed as a probabilistic translation model and then adapted to the glossing task. This model allows us to handle one of the main challenges posed by glossing, i.e. the fact that the list of potential labels for lexical morphemes is not fixed in advance and needs to be extended dynamically when labelling units are not seen in training. In such situations, we show how to make use of candidate lexical glosses found in the translation and discuss how such extension affects the training and inference procedures. The resulting automatic glossing systems prove to yield very competitive results, especially in low-resource settings.

Investigating the Translation Performance of a Large Multilingual Language Model: the Case of BLOOM
Rachel Bawden | François Yvon
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

The NLP community recently saw the release of a new large open-access multilingual language model, BLOOM (BigScience et al., 2022) covering 46 languages. We focus on BLOOM’s multilingual ability by evaluating its machine translation performance across several datasets (WMT, Flores-101 and DiaBLa) and language pairs (high- and low-resourced). Our results show that 0-shot performance suffers from overgeneration and generating in the wrong language, but this is greatly improved in the few-shot setting, with very good results for a number of language pairs. We study several aspects including prompt design, model sizes, cross-lingual transfer and the use of discursive context.

Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages
Ayyoob Imani | Peiqin Lin | Amir Hossein Kargaran | Silvia Severini | Masoud Jalili Sabet | Nora Kassner | Chunlan Ma | Helmut Schmid | André Martins | François Yvon | Hinrich Schütze
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 predominantly low-resource languages. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages and allows us to train Glot500-m. We evaluate Glot500-m on five diverse tasks across these languages. We observe large improvements for both high-resource and low-resource languages compared to an XLM-R baseline. Our analysis shows that no single factor explains the quality of multilingual LLM representations. Rather, a combination of factors determines quality including corpus size, script, “help” from related languages and the total capacity of the model. Our work addresses an important goal of NLP research: we should notlimit NLP to a small fraction of the world’s languages and instead strive to support as many languages as possible to bring the benefits of NLP technology to all languages and cultures. Code, data and models are available at https://github.com/cisnlp/Glot500.

GlotLID: Language Identification for Low-Resource Languages
Amir Hossein Kargaran | Ayyoob Imani | François Yvon | Hinrich Schuetze
Findings of the Association for Computational Linguistics: EMNLP 2023

Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages, (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work. In our experiments, GlotLID-M outperforms four baselines (CLD3, FT176, OpenLID and NLLB) when balancing F1 and false positive rate (FPR). We analyze the unique challenges that low-resource LID poses: incorrect corpus metadata, leakage from high-resource languages, difficulty separating closely related languages, handling of macrolanguage vs varieties and in general noisy data. We hope that integrating GlotLID-M into dataset creation pipelines will improve quality and enhance accessibility of NLP technology for low-resource languages and cultures. GlotLID-M model, code, and list of data sources are available: https://github.com/cisnlp/GlotLID.

Towards Example-Based NMT with Multi-Levenshtein Transformers
Maxime Bouthors | Josep Crego | François Yvon
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Retrieval-Augmented Machine Translation (RAMT) is attracting growing attention. This is because RAMT not only improves translation metrics, but is also assumed to implement some form of domain adaptation. In this contribution, we study another salient trait of RAMT, its ability to make translation decisions more transparent by allowing users to go back to examples that contributed to these decisions. For this, we propose a novel architecture aiming to increase this transparency. This model adapts a retrieval-augmented version of the Levenshtein Transformer and makes it amenable to simultaneously edit multiple fuzzy matches found in memory. We discuss how to perform training and inference in this model, based on multi-way alignment algorithms and imitation learning. Our experiments show that editing several examples positively impacts translation scores, notably increasing the number of target spans that are copied from existing instances.

Integrating Translation Memories into Non-Autoregressive Machine Translation
Jitao Xu | Josep Crego | François Yvon
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Non-autoregressive machine translation (NAT) has recently made great progress. However, most works to date have focused on standard translation tasks, even though some edit-based NAT models, such as the Levenshtein Transformer (LevT), seem well suited to translate with a Translation Memory (TM). This is the scenario considered here. We first analyze the vanilla LevT model and explain why it does not do well in this setting. We then propose a new variant, TM-LevT, and show how to effectively train this model. By modifying the data presentation and introducing an extra deletion operation, we obtain performance that are on par with an autoregressive approach, while reducing the decoding load. We also show that incorporating TMs during training dispenses to use knowledge distillation, a well-known trick used to mitigate the multimodality issue.

Structural generalization in COGS: Supertagging is (almost) all you need
Alban Petit | Caio Corro | François Yvon
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

In many Natural Language Processing applications, neural networks have been found to fail to generalize on out-of-distribution examples. In particular, several recent semantic parsing datasets have put forward important limitations of neural networks in cases where compositional generalization is required. In this work, we extend a neural graph-based parsing framework in several ways to alleviate this issue, notably: (1) the introduction of a supertagging step with valency constraints, expressed as an integer linear program; (2) the reduction of the graph prediction problem to the maximum matching problem; (3) the design of an incremental early-stopping training strategy to prevent overfitting. Experimentally, our approach significantly improves results on examples that require structural generalization in the COGS dataset, a known challenging benchmark for compositional generalization. Overall, these results confirm that structural constraints are important for generalization in semantic parsing.

2022

Graph Neural Networks for Multiparallel Word Alignment
Ayyoob Imani | Lütfi Kerem Senel | Masoud Jalili Sabet | François Yvon | Hinrich Schuetze
Findings of the Association for Computational Linguistics: ACL 2022

After a period of decrease, interest in word alignments is increasing again for their usefulness in domains such as typological research, cross-lingual annotation projection and machine translation. Generally, alignment algorithms only use bitext and do not make use of the fact that many parallel corpora are multiparallel. Here, we compute high-quality word alignments between multiple language pairs by considering all language pairs together. First, we create a multiparallel word alignment graph, joining all bilingual word alignment pairs in one graph. Next, we use graph neural networks (GNNs) to exploit the graph structure. Our GNN approach (i) utilizes information about the meaning, position and language of the input words, (ii) incorporates information from multiple parallel sentences, (iii) adds and removes edges from the initial alignments, and (iv) yields a prediction model that can generalize beyond the training sentences. We show that community detection algorithms can provide valuable information for multiparallel word alignment. Our method outperforms previous work on three word alignment datasets and on a downstream task.

Flux d’informations dans les systèmes encodeur-décodeur. Application à l’explication des biais de genre dans les systèmes de traduction automatique. (Information flow in encoder-decoder systems applied to the explanation of gender bias in machine translation systems)
Lichao Zhu | Guillaume Wisniewski | Nicolas Ballier | François Yvon
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Atelier TAL et Humanités Numériques (TAL-HN)

Ce travail présente deux séries d’expériences visant à identifier les flux d’information dans les systèmes de traduction neuronaux. La première série s’appuie sur une comparaison des décisions d’un modèle de langue et d’un modèle de traduction pour mettre en évidence le flux d’information provenant de la source. La seconde série met en évidence l’impact de ces flux sur l’apprentissage du système dans le cas particulier du transfert de l’information de genre.

Biais de genre dans un système de traduction automatique neuronale : une étude des mécanismes de transfert cross-langue [Gender bias in a neural machine translation system: a study of crosslingual transfer mechanisms]
Guillaume Wisniewski | Lichao Zhu | Nicolas Ballier | François Yvon
Traitement Automatique des Langues, Volume 63, Numéro 1 : Varia [Varia]

Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging
Ayyoob Imani | Silvia Severini | Masoud Jalili Sabet | François Yvon | Hinrich Schütze
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Part-of-Speech (POS) tagging is an important component of the NLP pipeline, but many low-resource languages lack labeled data for training. An established method for training a POS tagger in such a scenario is to create a labeled training set by transferring from high-resource languages. In this paper, we propose a novel method for transferring labels from multiple high-resource source to low-resource target languages. We formalize POS tag projection as graph-based label propagation. Given translations of a sentence in multiple languages, we create a graph with words as nodes and alignment links as edges by aligning words for all language pairs. We then propagate node labels from source to target using a Graph Neural Network augmented with transformer layers. We show that our propagation creates training sets that allow us to train POS taggers for a diverse set of languages. When combined with enhanced contextualized embeddings, our method achieves a new state-of-the-art for unsupervised POS tagging of low-resource languages.

Sous-titrage automatique : étude de stratégies d’adaptation aux genres télévisuels [Automatic closed captioning: a study of strategies for televisual genre adaptation]
François Buet | François Yvon
Traitement Automatique des Langues, Volume 63, Numéro 1 : Varia [Varia]

Latent Group Dropout for Multilingual and Multidomain Machine Translation
Minh-Quang Pham | François Yvon | Josep Crego
Findings of the Association for Computational Linguistics: NAACL 2022

Multidomain and multilingual machine translation often rely on parameter sharing strategies, where large portions of the network are meant to capture the commonalities of the tasks at hand, while smaller parts are reserved to model the peculiarities of a language or a domain. In adapter-based approaches, these strategies are hardcoded in the network architecture, independent of the similarities between tasks. In this work, we propose a new method to better take advantage of these similarities, using a latent-variable model. We also develop new techniques to train this model end-to-end and report experimental results showing that the learned patterns are both meaningful and yield improved translation performance without any increase of the model size.

Joint Generation of Captions and Subtitles with Dual Decoding
Jitao Xu | François Buet | Josep Crego | Elise Bertin-Lemée | François Yvon
Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)

As the amount of audio-visual content increases, the need to develop automatic captioning and subtitling solutions to match the expectations of a growing international audience appears as the only viable way to boost throughput and lower the related post-production costs. Automatic captioning and subtitling often need to be tightly intertwined to achieve an appropriate level of consistency and synchronization with each other and with the video signal. In this work, we assess a dual decoding scheme to achieve a strong coupling between these two tasks and show how adequacy and consistency are increased, with virtually no additional cost in terms of model size and training complexity.

Bilingual Synchronization: Restoring Translational Relationships with Editing Operations
Jitao Xu | Josep Crego | François Yvon
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Machine Translation (MT) is usually viewed as a one-shot process that generates the target language equivalent of some source text from scratch. We consider here a more general setting which assumes an initial target sequence, that must be transformed into a valid translation of the source, thereby restoring parallelism between source and target. For this bilingual synchronization task, we consider several architectures (both autoregressive and non-autoregressive) and training regimes, and experiment with multiple practical settings such as simulated interactive MT, translating with Translation Memory (TM) and TM cleaning. Our results suggest that one single generic edit-based system, once fine-tuned, can compare with, or even outperform, dedicated systems specifically trained for these tasks.

Modèle-s bayés-ien-s pour la segment-ation à deux niveau-x faible-ment super-vis-é-e (Bayesian models for weakly supervised two-level segmentation )
Shu Okabe | François Yvon
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

La segmentation automatique en mots et en morphèmes est une étape cruciale dans le processus de documentation des langues. Dans ce travail, nous étudions plusieurs modèles bayésiens pour réaliser une segmentation conjointe des phrases à ces deux niveaux : d’une part, en introduisant un couplage déterministe entre deux modèles spécialisés pour identifier chaque type de frontières, d’autre part, en proposant une modélisation intrinsèquement hiérarchique. Un objectif important de cette étude est de comparer ces modèles dans un scénario où une supervision faible est disponible. Nos expériences portent sur deux langues et permettent de comparer dans des conditions réalistes les mérites de ces diverses modélisations.

Evaluating Subtitle Segmentation for End-to-end Generation Systems
Alina Karakanta | François Buet | Mauro Cettolo | François Yvon
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Subtitles appear on screen as short pieces of text, segmented based on formal constraints (length) and syntactic/semantic criteria. Subtitle segmentation can be evaluated with sequence segmentation metrics against a human reference. However, standard segmentation metrics cannot be applied when systems generate outputs different than the reference, e.g. with end-to-end subtitling systems. In this paper, we study ways to conduct reference-based evaluations of segmentation accuracy irrespective of the textual content. We first conduct a systematic analysis of existing metrics for evaluating subtitle segmentation. We then introduce Sigma, a Subtitle Segmentation Score derived from an approximate upper-bound of BLEU on segmentation boundaries, which allows us to disentangle the effect of good segmentation from text quality. To compare Sigma with existing metrics, we further propose a boundary projection method from imperfect hypotheses to the true reference. Results show that all metrics are able to reward high quality output but for similar outputs system ranking depends on each metric’s sensitivity to error type. Our thorough analyses suggest Sigma is a promising segmentation candidate but its reliability over other segmentation metrics remains to be validated through correlations with human judgements.

Analyzing Gender Translation Errors to Identify Information Flows between the Encoder and Decoder of a NMT System
Guillaume Wisniewski | Lichao Zhu | Nicolas Ballier | François Yvon
Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Multiple studies have shown that existing NMT systems demonstrate some kind of “gender bias”. As a result, MT output appears to err more often for feminine forms and to amplify social gender misrepresentations, which is potentially harmful to users and practioners of these technologies. This paper continues this line of investigations and reports results obtained with a new test set in strictly controlled conditions. This setting allows us to better understand the multiple inner mechanisms that are causing these biases, which include the linguistic expressions of gender, the unbalanced distribution of masculine and feminine forms in the language, the modelling of morphological variation and the training process dynamics. To counterbalance these effects, we formulate several proposals and notably show that modifying the training loss can effectively mitigate such biases.

Multi-Domain Adaptation in Neural Machine Translation with Dynamic Sampling Strategies
Minh-Quang Pham | Josep Crego | François Yvon
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

Building effective Neural Machine Translation models often implies accommodating diverse sets of heterogeneous data so as to optimize performance for the domain(s) of interest. Such multi-source / multi-domain adaptation problems are typically approached through instance selection or reweighting strategies, based on a static assessment of the relevance of training instances with respect to the task at hand. In this paper, we study dynamic data selection strategies that are able to automatically re-evaluate the usefulness of data samples and to evolve a data selection policy in the course of training. Based on the results of multiple experiments, we show that such methods constitute a generic framework to automatically and effectively handle a variety of real-world situations, from multi-source domain adaptation to multi-domain learning and unsupervised domain adaptation.

Ré-ordonnancement via programmation dynamique pour l’adaptation cross-lingue d’un analyseur en dépendances (Sentence reordering via dynamic programming for cross-lingual dependency parsing )
Nicolas Devatine | Caio Corro | François Yvon
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Cet article s’intéresse au transfert cross-lingue d’analyseurs en dépendances et étudie des méthodes pour limiter l’effet potentiellement néfaste pour le transfert de divergences entre l’ordre des mots dans les langues source et cible. Nous montrons comment apprendre et implémenter des stratégies de réordonnancement, qui, utilisées en prétraitement, permettent souvent d’améliorer les performances des analyseurs dans un scénario de transfert « zero-shot ».

Weakly Supervised Word Segmentation for Computational Language Documentation
Shu Okabe | Laurent Besacier | François Yvon
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Word and morpheme segmentation are fundamental steps of language documentation as they allow to discover lexical units in a language for which the lexicon is unknown. However, in most language documentation scenarios, linguists do not start from a blank page: they may already have a pre-existing dictionary or have initiated manual segmentation of a small part of their data. This paper studies how such a weak supervision can be taken advantage of in Bayesian non-parametric models of segmentation. Our experiments on two very low resource languages (Mboshi and Japhug), whose documentation is still in progress, show that weak supervision can be beneficial to the segmentation quality. In addition, we investigate an incremental learning scenario where manual segmentations are provided in a sequential manner. This work opens the way for interactive annotation tools for documentary linguists.

2021

LISN @ WMT 2021
Jitao Xu | Minh Quang Pham | Sadaf Abdul Rauf | François Yvon
Proceedings of the Sixth Conference on Machine Translation

This paper describes LISN’s submissions to two shared tasks at WMT’21. For the biomedical translation task, we have developed resource-heavy systems for the English-French language pair, using both out-of-domain and in-domain corpora. The target genre for this task (scientific abstracts) corresponds to texts that often have a standardized structure. Our systems attempt to take this structure into account using a hierarchical system of sentence-level tags. Translation systems were also prepared for the News task for the French-German language pair. The challenge was to perform unsupervised adaptation to the target domain (financial news). For this, we explored the potential of retrieval-based strategies, where sentences that are similar to test instances are used to prime the decoder.

Optimizing Word Alignments with Better Subword Tokenization
Anh Khoa Ngo Ho | François Yvon
Proceedings of Machine Translation Summit XVIII: Research Track

Word alignment identify translational correspondences between words in a parallel sentence pair and are used and for example and to train statistical machine translation and learn bilingual dictionaries or to perform quality estimation. Subword tokenization has become a standard preprocessing step for a large number of applications and notably for state-of-the-art open vocabulary machine translation systems. In this paper and we thoroughly study how this preprocessing step interacts with the word alignment task and propose several tokenization strategies to obtain well-segmented parallel corpora. Using these new techniques and we were able to improve baseline word-based alignment models for six language pairs.

Graph Algorithms for Multiparallel Word Alignment
Ayyoob Imani | Masoud Jalili Sabet | Lutfi Kerem Senel | Philipp Dufter | François Yvon | Hinrich Schütze
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

With the advent of end-to-end deep learning approaches in machine translation, interest in word alignments initially decreased; however, they have again become a focus of research more recently. Alignments are useful for typological research, transferring formatting like markup to translated texts, and can be used in the decoding of machine translation systems. At the same time, massively multilingual processing is becoming an important NLP scenario, and pretrained language and machine translation models that are truly multilingual are proposed. However, most alignment algorithms rely on bitexts only and do not leverage the fact that many parallel corpora are multiparallel. In this work, we exploit the multiparallelity of corpora by representing an initial set of bilingual alignments as a graph and then predicting additional edges in the graph. We present two graph algorithms for edge prediction: one inspired by recommender systems and one based on network link prediction. Our experimental results show absolute improvements in F1 of up to 28% over the baseline bilingual word aligner in different datasets.

Vers la production automatique de sous-titres adaptés à l’affichage (Towards automatic adapted monolingual captioning)
François Buet | François Yvon
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Une façon de réaliser un sous-titrage automatique monolingue est d’associer un système de reconnaissance de parole avec un modèle de traduction de la transcription vers les sous-titres. La tâche de « traduction » est délicate dans la mesure où elle doit opérer une simplification et une compression du texte, respecter des normes liées à l’affichage, tout en composant avec les erreurs issues de la reconnaissance vocale. Une difficulté supplémentaire est la relative rareté des corpus mettant en parallèle transcription automatique et sous-titres sont relativement rares. Nous décrivons ici un nouveau corpus en cours de constitution et nous expérimentons l’utilisation de méthodes de contrôle plus ou moins direct de la longueur des phrases engendrées, afin d’améliorer leur qualité du point de vue linguistique et normatif.

One Source, Two Targets: Challenges and Rewards of Dual Decoding
Jitao Xu | François Yvon
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Machine translation is generally understood as generating one target text from an input source document. In this paper, we consider a stronger requirement: to jointly generate two texts so that each output side effectively depends on the other. As we discuss, such a device serves several practical purposes, from multi-target machine translation to the generation of controlled variations of the target text. We present an analysis of possible implementations of dual decoding, and experiment with four applications. Viewing the problem from multiple angles allows us to better highlight the challenges of dual decoding and to also thoroughly analyze the benefits of generating matched, rather than independent, translations.

Can You Traducir This? Machine Translation for Code-Switched Input
Jitao Xu | François Yvon
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching

Code-Switching (CSW) is a common phenomenon that occurs in multilingual geographic or social contexts, which raises challenging problems for natural language processing tools. We focus here on Machine Translation (MT) of CSW texts, where we aim to simultaneously disentangle and translate the two mixed languages. Due to the lack of actual translated CSW data, we generate artificial training data from regular parallel texts. Experiments show this training strategy yields MT systems that surpass multilingual systems for code-switched texts. These results are confirmed in an alternative task aimed at providing contextual translations for a L2 writing assistant.

Biais de genre dans un système de traduction automatiqueneuronale : une étude préliminaire (Gender Bias in Neural Translation : a preliminary study )
Guillaume Wisniewski | Lichao Zhu | Nicolas Ballier | François Yvon
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Cet article présente les premiers résultats d’une étude en cours sur les biais de genre dans les corpus d’entraînements et dans les systèmes de traduction neuronale. Nous étudions en particulier un corpus minimal et contrôlé pour mesurer l’intensité de ces biais dans les deux directions anglais-français et français-anglais ; ce cadre contrôlé nous permet également d’analyser les représentations internes manipulées par le système pour réaliser ses prédictions lexicales, ainsi que de formuler des hypothèses sur la manière dont ce biais se distribue dans les représentations du système.

Revisiting Multi-Domain Machine Translation
MinhQuang Pham | Josep Maria Crego | François Yvon
Transactions of the Association for Computational Linguistics, Volume 9

When building machine translation systems, one often needs to make the best out of heterogeneous sets of parallel data in training, and to robustly handle inputs from unexpected domains in testing. This multi-domain scenario has attracted a lot of recent work that fall under the general umbrella of transfer learning. In this study, we revisit multi-domain machine translation, with the aim to formulate the motivations for developing such systems and the associated expectations with respect to performance. Our experiments with a large sample of multi-domain systems show that most of these expectations are hardly met and suggest that further work is needed to better analyze the current behaviour of multi-domain systems and to make them fully hold their promises.

Screening Gender Transfer in Neural Machine Translation
Guillaume Wisniewski | Lichao Zhu | Nicolas Ballier | François Yvon
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

This paper aims at identifying the information flow in state-of-the-art machine translation systems, taking as example the transfer of gender when translating from French into English. Using a controlled set of examples, we experiment several ways to investigate how gender information circulates in a encoder-decoder architecture considering both probing techniques as well as interventions on the internal representations used in the MT system. Our results show that gender information can be found in all token representations built by the encoder and the decoder and lead us to conclude that there are multiple pathways for gender transfer.

2020

SimAlign: High Quality Word Alignments Without Parallel Training Data Using Static and Contextualized Embeddings
Masoud Jalili Sabet | Philipp Dufter | François Yvon | Hinrich Schütze
Findings of the Association for Computational Linguistics: EMNLP 2020

Word alignments are useful for tasks like statistical and neural machine translation (NMT) and cross-lingual annotation projection. Statistical word aligners perform well, as do methods that extract alignments jointly with translations in NMT. However, most approaches require parallel training data and quality decreases as less training data is available. We propose word alignment methods that require no parallel data. The key idea is to leverage multilingual word embeddings – both static and contextualized – for word alignment. Our multilingual embeddings are created from monolingual data only without relying on any parallel data or dictionaries. We find that alignments created from embeddings are superior for four and comparable for two language pairs compared to those produced by traditional statistical aligners – even with abundant parallel data; e.g., contextualized embeddings achieve a word alignment F1 for English-German that is 5 percentage points higher than eflomal, a high-quality statistical aligner, trained on 100k parallel sentences.

Priming Neural Machine Translation
Minh Quang Pham | Jitao Xu | Josep Crego | François Yvon | Jean Senellart
Proceedings of the Fifth Conference on Machine Translation

Priming is a well known and studied psychology phenomenon based on the prior presentation of one stimulus (cue) to influence the processing of a response. In this paper, we propose a framework to mimic the process of priming in the context of neural machine translation (NMT). We evaluate the effect of using similar translations as priming cues on the NMT network. We propose a method to inject priming cues into the NMT network and compare our framework to other mechanisms that perform micro-adaptation during inference. Overall, experiments conducted in a multi-domain setting confirm that adding priming cues in the NMT decoder can go a long way towards improving the translation accuracy. Besides, we show the suitability of our framework to gather valuable information for an NMT network from monolingual resources.

Simplification automatique de texte dans un contexte de faibles ressources (Automatic Text Simplification : Approaching the Problem in Low Resource Settings for French)
Sadaf Abdul Rauf | Anne-Laure Ligozat | Francois Yvon | Gabriel Illouz | Thierry Hamon
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

La simplification de textes a émergé comme un sous-domaine actif du traitement automatique des langues, du fait des problèmes pratiques et théoriques qu’elle permet d’aborder, ainsi que de ses nombreuses applications pratiques. Des corpus de simplification sont nécessaires pour entrainer des systèmes de simplification automatique ; ces ressources sont toutefois rares et n’existent que pour un petit nombre de langues. Nous montrons ici que dans un contexte où les ressources pour la simplification sont rares, il reste néanmoins possible de construire des systèmes de simplification, en ayant recours à des corpus synthétiques, par exemple obtenus par traduction automatique, et nous évaluons diverses manières de les constituer.

Generative latent neural models for automatic word alignment
Anh Khoa Ngo Ho | François Yvon
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

A Study of Residual Adapters for Multi-Domain Neural Machine Translation
Minh Quang Pham | Josep Maria Crego | François Yvon | Jean Senellart
Proceedings of the Fifth Conference on Machine Translation

Domain adaptation is an old and vexing problem for machine translation systems. The most common approach and successful to supervised adaptation is to fine-tune a baseline system with in-domain parallel data. Standard fine-tuning however modifies all the network parameters, which makes this approach computationally costly and prone to overfitting. A recent, lightweight approach, instead augments a baseline model with supplementary (small) adapter layers, keeping the rest of the mode unchanged. This has the additional merit to leave the baseline model intact, and adaptable to multiple domains. In this paper, we conduct a thorough analysis of the adapter model in the context of a multidomain machine translation task. We contrast multiple implementations of this idea on two language pairs. Our main conclusions are that residual adapters provide a fast and cheap method for supervised multi-domain adaptation; our two variants prove as effective as the original adapter model, and open perspective to also make adapted models more robust to label domain errors.

LIMSI @ WMT 2020
Sadaf Abdul Rauf | José Carlos Rosales Núñez | Minh Quang Pham | François Yvon
Proceedings of the Fifth Conference on Machine Translation

This paper describes LIMSI’s submissions to the translation shared tasks at WMT’20. This year we have focused our efforts on the biomedical translation task, developing a resource-heavy system for the translation of medical abstracts from English into French, using back-translated texts, terminological resources as well as multiple pre-processing pipelines, including pre-trained representations. Systems were also prepared for the robustness task for translating from English into German; for this large-scale task we developed multi-domain, noise-robust, translation systems aim to handle the two test conditions: zero-shot and few-shot domain adaptation.

The European Language Technology Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communication in Multilingual Europe
Georg Rehm | Katrin Marheinecke | Stefanie Hegele | Stelios Piperidis | Kalina Bontcheva | Jan Hajič | Khalid Choukri | Andrejs Vasiļjevs | Gerhard Backfried | Christoph Prinz | José Manuel Gómez-Pérez | Luc Meertens | Paul Lukowicz | Josef van Genabith | Andrea Lösch | Philipp Slusallek | Morten Irgens | Patrick Gatellier | Joachim Köhler | Laure Le Bars | Dimitra Anastasiou | Albina Auksoriūtė | Núria Bel | António Branco | Gerhard Budin | Walter Daelemans | Koenraad De Smedt | Radovan Garabík | Maria Gavriilidou | Dagmar Gromann | Svetla Koeva | Simon Krek | Cvetana Krstev | Krister Lindén | Bernardo Magnini | Jan Odijk | Maciej Ogrodniczuk | Eiríkur Rögnvaldsson | Mike Rosner | Bolette Pedersen | Inguna Skadiņa | Marko Tadić | Dan Tufiș | Tamás Váradi | Kadri Vider | Andy Way | François Yvon
Proceedings of the Twelfth Language Resources and Evaluation Conference

Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe’s specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI – including many opportunities, synergies but also misconceptions – has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in industry and the LT market. We present a brief overview of the main LT-related activities on the EU level in the last ten years and develop strategic guidance with regard to four key dimensions.

Proceedings of the 17th International Conference on Spoken Language Translation
Marcello Federico | Alex Waibel | Kevin Knight | Satoshi Nakamura | Hermann Ney | Jan Niehues | Sebastian Stüker | Dekai Wu | Joseph Mariani | Francois Yvon
Proceedings of the 17th International Conference on Spoken Language Translation

2019

Neural Baselines for Word Alignment
Anh Khoa Ngo Ho | François Yvon
Proceedings of the 16th International Conference on Spoken Language Translation

Word alignments identify translational correspondences between words in a parallel sentence pair and is used, for instance, to learn bilingual dictionaries, to train statistical machine translation systems, or to perform quality estimation. In most areas of natural lan- guage processing, neural network models nowadays constitute the preferred approach, a situation that might also apply to word align- ment models. In this work, we study and comprehensively evaluate neural models for unsupervised word alignment for four language pairs, contrasting several variants of neural models. We show that in most settings, neural versions of the IBM-1 and hidden Markov models vastly outperform their discrete counterparts. We also analyze typical alignment errors of the baselines that our models over- come to illustrate the benefits — and the limitations — of these new models for morphologically rich languages.

How Bad are PoS Tagger in Cross-Corpora Settings? Evaluating Annotation Divergence in the UD Project.
Guillaume Wisniewski | François Yvon
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

The performance of Part-of-Speech tagging varies significantly across the treebanks of the Universal Dependencies project. This work points out that these variations may result from divergences between the annotation of train and test sets. We show how the annotation variation principle, introduced by Dickinson and Meurers (2003) to automatically detect errors in gold standard, can be used to identify inconsistencies between annotations; we also evaluate their impact on prediction performance.

Controlling Utterance Length in NMT-based Word Segmentation with Attention
Pierre Godard | Laurent Besacier | François Yvon
Proceedings of the 16th International Conference on Spoken Language Translation

One of the basic tasks of computational language documentation (CLD) is to identify word boundaries in an unsegmented phonemic stream. While several unsupervised monolingual word segmentation algorithms exist in the literature, they are challenged in real-world CLD settings by the small amount of available data. A possible remedy is to take advantage of glosses or translation in a foreign, well- resourced, language, which often exist for such data. In this paper, we explore and compare ways to exploit neural machine translation models to perform unsupervised boundary detection with bilingual information, notably introducing a new loss function for jointly learning alignment and segmentation. We experiment with an actual under-resourced language, Mboshi, and show that these techniques can effectively control the output segmentation length.

Generic and Specialized Word Embeddings for Multi-Domain Machine Translation
MinhQuang Pham | Josep Crego | François Yvon | Jean Senellart
Proceedings of the 16th International Conference on Spoken Language Translation

Supervised machine translation works well when the train and test data are sampled from the same distribution. When this is not the case, adaptation techniques help ensure that the knowledge learned from out-of-domain texts generalises to in-domain sentences. We study here a related setting, multi-domain adaptation, where the number of domains is potentially large and adapting separately to each domain would waste training resources. Our proposal transposes to neural machine translation the feature expansion technique of (Daumé III, 2007): it isolates domain-agnostic from domain-specific lexical representations, while sharing the most of the network across domains. Our experiments use two architectures and two language pairs: they show that our approach, while simple and computationally inexpensive, outperforms several strong baselines and delivers a multi-domain system that successfully translates texts from diverse sources.

Measuring text readability with machine comprehension: a pilot study
Marc Benzahra | François Yvon
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

This article studies the relationship between text readability indice and automatic machine understanding systems. Our hypothesis is that the simpler a text is, the better it should be understood by a machine. We thus expect to a strong correlation between readability levels on the one hand, and performance of automatic reading systems on the other hand. We test this hypothesis with several understanding systems based on language models of varying strengths, measuring this correlation on two corpora of journalistic texts. Our results suggest that this correlation is rather small that existing comprehension systems are far to reproduce the gradual improvement of their performance on texts of decreasing complexity.

2018

Fixing Translation Divergences in Parallel Corpora for Neural MT
MinhQuang Pham | Josep Crego | Jean Senellart | François Yvon
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Corpus-based approaches to machine translation rely on the availability of clean parallel corpora. Such resources are scarce, and because of the automatic processes involved in their preparation, they are often noisy. This paper describes an unsupervised method for detecting translation divergences in parallel sentences. We rely on a neural network that computes cross-lingual sentence similarity scores, which are then used to effectively filter out divergent translations. Furthermore, similarity scores predicted by the network are used to identify and fix some partial divergences, yielding additional parallel segments. We evaluate these methods for English-French and English-German machine translation tasks, and show that using filtered/corrected corpora actually improves MT performance.

Quantifying training challenges of dependency parsers
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the 27th International Conference on Computational Linguistics

Not all dependencies are equal when training a dependency parser: some are straightforward enough to be learned with only a sample of data, others embed more complexity. This work introduces a series of metrics to quantify those differences, and thereby to expose the shortcomings of various parsing algorithms and strategies. Apart from a more thorough comparison of parsing systems, these new tools also prove useful for characterizing the information conveyed by cross-lingual parsers, in a quantitative but still interpretable way.

Les méthodes « apprendre à chercher » en traitement automatique des langues : un état de l’art [A survey of learning-to-search techniques in Natural Language Processing]
Elena Knyazeva | Guillaume Wisniewski | François Yvon
Traitement Automatique des Langues, Volume 59, Numéro 1 : Varia [Varia]

Exploiting Dynamic Oracles to Train Projective Dependency Parsers on Non-Projective Trees
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Because the most common transition systems are projective, training a transition-based dependency parser often implies to either ignore or rewrite the non-projective training examples, which has an adverse impact on accuracy. In this work, we propose a simple modification of dynamic oracles, which enables the use of non-projective data when training projective parsers. Evaluation on 73 treebanks shows that our method achieves significant gains (+2 to +7 UAS for the most non-projective languages) and consistently outperforms traditional projectivization and pseudo-projectivization approaches.

Automatically Selecting the Best Dependency Annotation Design with Dynamic Oracles
Guillaume Wisniewski | Ophélie Lacroix | François Yvon
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

This work introduces a new strategy to compare the numerous conventions that have been proposed over the years for expressing dependency structures and discover the one for which a parser will achieve the highest parsing performance. Instead of associating each sentence in the training set with a single gold reference we propose to consider a set of references encoding alternative syntactic representations. Training a parser with a dynamic oracle will then automatically select among all alternatives the reference that will be predicted with the highest accuracy. Experiments on the UD corpora show the validity of this approach.

A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments
Pierre Godard | Gilles Adda | Martine Adda-Decker | Juan Benjumea | Laurent Besacier | Jamison Cooper-Leavitt | Guy-Noel Kouarata | Lori Lamel | Hélène Maynard | Markus Mueller | Annie Rialland | Sebastian Stueker | François Yvon | Marcely Zanon-Boito
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Divergences entre annotations dans le projet Universal Dependencies et leur impact sur l’évaluation des performance d’étiquetage morpho-syntaxique (Evaluating Annotation Divergences in the UD Project)
Guillaume Wisniewski | François Yvon
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

Ce travail montre que la dégradation des performances souvent observée lors de l’application d’un analyseur morpho-syntaxique à des données hors domaine résulte souvent d’incohérences entre les annotations des ensembles de test et d’apprentissage. Nous montrons comment le principe de variation des annotations, introduit par Dickinson & Meurers (2003) pour identifier automatiquement les erreurs d’annotation, peut être utilisé pour identifier ces incohérences et évaluer leur impact sur les performances des analyseurs morpho-syntaxiques.

Évaluation morphologique pour la traduction automatique : adaptation au français (Morphological Evaluation for Machine Translation : Adaptation to French)
Franck Burlot | François Yvon
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

Le nouvel état de l’art en traduction automatique (TA) s’appuie sur des méthodes neuronales, qui différent profondément des méthodes utilisées antérieurement. Les métriques automatiques classiques sont mal adaptées pour rendre compte de la nature du saut qualitatif observé. Cet article propose un protocole d’évaluation pour la traduction de l’anglais vers le français spécifiquement focalisé sur la compétence morphologique des systèmes de TA, en étudiant leurs performances sur différents phénomènes grammaticaux.

Using Monolingual Data in Neural Machine Translation: a Systematic Study
Franck Burlot | François Yvon
Proceedings of the Third Conference on Machine Translation: Research Papers

Neural Machine Translation (MT) has radically changed the way systems are developed. A major difference with the previous generation (Phrase-Based MT) is the way monolingual target data, which often abounds, is used in these two paradigms. While Phrase-Based MT can seamlessly integrate very large language models trained on billions of sentences, the best option for Neural MT developers seems to be the generation of artificial parallel data through back-translation - a technique that fails to fully take advantage of existing datasets. In this paper, we conduct a systematic study of back-translation, comparing alternative uses of monolingual data, as well as multiple data generation procedures. Our findings confirm that back-translation is very effective and give new explanations as to why this is the case. We also introduce new data simulation techniques that are almost as effective, yet much cheaper to implement.

Adaptor Grammars for the Linguist: Word Segmentation Experiments for Very Low-Resource Languages
Pierre Godard | Laurent Besacier | François Yvon | Martine Adda-Decker | Gilles Adda | Hélène Maynard | Annie Rialland
Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology

Computational Language Documentation attempts to make the most recent research in speech and language technologies available to linguists working on language preservation and documentation. In this paper, we pursue two main goals along these lines. The first is to improve upon a strong baseline for the unsupervised word discovery task on two very low-resource Bantu languages, taking advantage of the expertise of linguists on these particular languages. The second consists in exploring the Adaptor Grammar framework as a decision and prediction tool for linguists studying a new language. We experiment 162 grammar configurations for each language and show that using Adaptor Grammars for word segmentation enables us to test hypotheses about a language. Specializing a generic grammar with language specific knowledge leads to great improvements for the word discovery task, ultimately achieving a leap of about 30% token F-score from the results of a strong baseline.

The WMT’18 Morpheval test suites for English-Czech, English-German, English-Finnish and Turkish-English
Franck Burlot | Yves Scherrer | Vinit Ravishankar | Ondřej Bojar | Stig-Arne Grönroos | Maarit Koponen | Tommi Nieminen | François Yvon
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

Progress in the quality of machine translation output calls for new automatic evaluation procedures and metrics. In this paper, we extend the Morpheval protocol introduced by Burlot and Yvon (2017) for the English-to-Czech and English-to-Latvian translation directions to three additional language pairs, and report its use to analyze the results of WMT 2018’s participants for these language pairs. Considering additional, typologically varied source and target languages also enables us to draw some generalizations regarding this morphology-oriented evaluation procedure.

2017

Learning the Structure of Variable-Order CRFs: a finite-state perspective
Thomas Lavergne | François Yvon
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

The computational complexity of linear-chain Conditional Random Fields (CRFs) makes it difficult to deal with very large label sets and long range dependencies. Such situations are not rare and arise when dealing with morphologically rich languages or joint labelling tasks. We extend here recent proposals to consider variable order CRFs. Using an effective finite-state representation of variable-length dependencies, we propose new ways to perform feature selection at large scale and report experimental results where we outperform strong baselines on a tagging task.

LIMSI@CoNLL’17: UD Shared Task
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

This paper describes LIMSI’s submission to the CoNLL 2017 UD Shared Task, which is focused on small treebanks, and how to improve low-resourced parsing only by ad hoc combination of multiple views and resources. We present our approach for low-resourced parsing, together with a detailed analysis of the results for each test treebank. We also report extensive analysis experiments on model selection for the PUD treebanks, and on annotation consistency among UD treebanks.

Don’t Stop Me Now! Using Global Dynamic Oracles to Correct Training Biases of Transition-Based Dependency Parsers
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

This paper formalizes a sound extension of dynamic oracles to global training, in the frame of transition-based dependency parsers. By dispensing with the pre-computation of references, this extension widens the training strategies that can be entertained for such parsers; we show this by revisiting two standard training procedures, early-update and max-violation, to correct some of their search space sampling biases. Experimentally, on the SPMRL treebanks, this improvement increases the similarity between the train and test distributions and yields performance improvements up to 0.7 UAS, without any computation overhead.

Word Representations in Factored Neural Machine Translation
Franck Burlot | Mercedes García-Martínez | Loïc Barrault | Fethi Bougares | François Yvon
Proceedings of the Second Conference on Machine Translation

LIMSI@WMT’17
Franck Burlot | Pooyan Safari | Matthieu Labeau | Alexandre Allauzen | François Yvon
Proceedings of the Second Conference on Machine Translation

Normalisation automatique du vocabulaire source pour traduire depuis une langue à morphologie riche (Learning Morphological Normalization for Translation from Morphologically Rich Languages)
Franck Burlot | François Yvon
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 - Articles longs

Lorsqu’ils sont traduits depuis une langue à morphologie riche vers l’anglais, les mots-formes sources contiennent des marques d’informations grammaticales pouvant être jugées redondantes par rapport à l’anglais, causant une variabilité formelle qui nuit à l’estimation des modèles probabilistes. Un moyen bien documenté pour atténuer ce problème consiste à supprimer l’information non pertinente de la source en la normalisant. Ce pré-traitement est généralement effectué de manière déterministe, à l’aide de règles produites manuellement. Une telle normalisation est, par essence, sous-optimale et doit être adaptée pour chaque paire de langues. Nous présentons, dans cet article, une méthode simple pour rechercher automatiquement une normalisation optimale de la morphologie source par rapport à la langue cible et montrons que celle-ci peut améliorer la traduction automatique.

Adaptation au domaine pour l’analyse morpho-syntaxique (Domain Adaptation for PoS tagging)
Éléonor Bartenlian | Margot Lacour | Matthieu Labeau | Alexandre Allauzen | Guillaume Wisniewski | François Yvon
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts

Ce travail cherche à comprendre pourquoi les performances d’un analyseur morpho-syntaxiques chutent fortement lorsque celui-ci est utilisé sur des données hors domaine. Nous montrons à l’aide d’une expérience jouet que ce comportement peut être dû à un phénomène de masquage des caractéristiques lexicalisées par les caractéristiques non lexicalisées. Nous proposons plusieurs modèles essayant de réduire cet effet.

Evaluating the morphological competence of Machine Translation Systems
Franck Burlot | François Yvon
Proceedings of the Second Conference on Machine Translation

2016

Apprentissage discriminant de modèles neuronaux pour la traduction automatique [Discriminative training of continuous space translation models]
Quoc-Khanh Do | Alexandre Allauzen | François Yvon
Traitement Automatique des Langues, Volume 57, Numéro 1 : Varia [Varia]

TransRead: Designing a Bilingual Reading Experience with Machine Translation Technologies
François Yvon | Yong Xu | Marianna Apidianaki | Clément Pillias | Pierre Cubaud
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

Cross-lingual and Supervised Models for Morphosyntactic Annotation: a Comparison on Romanian
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Because of the small size of Romanian corpora, the performance of a PoS tagger or a dependency parser trained with the standard supervised methods fall far short from the performance achieved in most languages. That is why, we apply state-of-the-art methods for cross-lingual transfer on Romanian tagging and parsing, from English and several Romance languages. We compare the performance with monolingual systems trained with sets of different sizes and establish that training on a few sentences in target language yields better results than transferring from large datasets in other languages.

LIMSI’s Contribution to the WMT’16 Biomedical Translation Task
Julia Ive | Aurélien Max | François Yvon
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

Apprentissage d’analyseur en dépendances cross-lingue par projection partielle de dépendances (Cross-lingual learning of dependency parsers from partially projected dependencies )
Ophélie Lacroix | Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)

Cet article présente une méthode simple de transfert cross-lingue de dépendances. Nous montrons tout d’abord qu’il est possible d’apprendre un analyseur en dépendances par transition à partir de données partiellement annotées. Nous proposons ensuite de construire de grands ensembles de données partiellement annotés pour plusieurs langues cibles en projetant les dépendances via les liens d’alignement les plus sûrs. En apprenant des analyseurs pour les langues cibles à partir de ces données partielles, nous montrons que cette méthode simple obtient des performances qui rivalisent avec celles de méthodes état-de-l’art récentes, tout en ayant un coût algorithmique moindre.

Parallel Sentence Compression
Julia Ive | François Yvon
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Sentence compression is a way to perform text simplification and is usually handled in a monolingual setting. In this paper, we study ways to extend sentence compression in a bilingual context, where the goal is to obtain parallel compressions of parallel sentences. This can be beneficial for a series of multilingual natural language processing (NLP) tasks. We compare two ways to take bilingual information into account when compressing parallel sentences. Their efficiency is contrasted on a parallel corpus of News articles.

Cross-lingual Dependency Transfer : What Matters? Assessing the Impact of Pre- and Post-processing
Ophélie Lacroix | Guillaume Wisniewski | François Yvon
Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP

LIMSI@IWSLT’16: MT Track
Franck Burlot | Matthieu Labeau | Elena Knyazeva | Thomas Lavergne | Alexandre Allauzen | François Yvon
Proceedings of the 13th International Conference on Spoken Language Translation

This paper describes LIMSI’s submission to the MT track of IWSLT 2016. We report results for translation from English into Czech. Our submission is an attempt to address the difficulties of translating into a morphologically rich language by paying special attention to the morphology generation on target side. To this end, we propose two ways of improving the morphological fluency of the output: 1. by performing translation and inflection of the target language in two separate steps, and 2. by using a neural language model with characted-based word representation. We finally present the combination of both methods used for our primary system submission.

Zero-resource Dependency Parsing: Boosting Delexicalized Cross-lingual Transfer with Linguistic Knowledge
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

This paper studies cross-lingual transfer for dependency parsing, focusing on very low-resource settings where delexicalized transfer is the only fully automatic option. We show how to boost parsing performance by rewriting the source sentences so as to better match the linguistic regularities of the target language. We contrast a data-driven approach with an approach relying on linguistically motivated rules automatically extracted from the World Atlas of Language Structures. Our findings are backed up by experiments involving 40 languages. They show that both approaches greatly outperform the baseline, the knowledge-driven method yielding the best accuracies, with average improvements of +2.9 UAS, and up to +90 UAS (absolute) on some frequent PoS configurations.

Two-Step MT: Predicting Target Morphology
Franck Burlot | Elena Knyazeva | Thomas Lavergne | François Yvon
Proceedings of the 13th International Conference on Spoken Language Translation

This paper describes a two-step machine translation system that addresses the issue of translating into a morphologically rich language (English to Czech), by performing separately the translation and the generation of target morphology. The first step consists in translating from English into a normalized version of Czech, where some morphological information has been removed. The second step retrieves this information and re-inflects the normalized output, turning it into fully inflected Czech. We introduce different setups for the second step and evaluate the quality of their predictions over different MT systems trained on different amounts of parallel and monolingual data and report ways to adapt to different data sizes, which improves the translation in low-resource conditions, as well as when large training data is available.

Novel elicitation and annotation schemes for sentential and sub-sentential alignments of bitexts
Yong Xu | François Yvon
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Resources for evaluating sentence-level and word-level alignment algorithms are unsatisfactory. Regarding sentence alignments, the existing data is too scarce, especially when it comes to difficult bitexts, containing instances of non-literal translations. Regarding word-level alignments, most available hand-aligned data provide a complete annotation at the level of words that is difficult to exploit, for lack of a clear semantics for alignment links. In this study, we propose new methodologies for collecting human judgements on alignment links, which have been used to annotate 4 new data sets, at the sentence and at the word level. These will be released online, with the hope that they will prove useful to evaluate alignment software and quality estimation tools for automatic alignment. Keywords: Parallel corpora, Sentence Alignments, Word Alignments, Confidence Estimation

Frustratingly Easy Cross-Lingual Transfer for Transition-Based Dependency Parsing
Ophélie Lacroix | Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

LIMSI@WMT’16: Machine Translation of News
Alexandre Allauzen | Lauriane Aufrant | Franck Burlot | Ophélie Lacroix | Elena Knyazeva | Thomas Lavergne | Guillaume Wisniewski | François Yvon
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

Ne nous arrêtons pas en si bon chemin : améliorations de l’apprentissage global d’analyseurs en dépendances par transition (Don’t Stop Me Now ! Improved Update Strategies for Global Training of Transition-Based)
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)

Dans cet article, nous proposons trois améliorations simples pour l’apprentissage global d’analyseurs en dépendances par transition de type A RC E AGER : un oracle non déterministe, la reprise sur le même exemple après une mise à jour et l’entraînement en configurations sous-optimales. Leur combinaison apporte un gain moyen de 0,2 UAS sur le corpus SPMRL. Nous introduisons également un cadre général permettant la comparaison systématique de ces stratégies et de la plupart des variantes connues. Nous montrons que la littérature n’a étudié que quelques stratégies parmi les nombreuses variations possibles, négligeant ainsi plusieurs pistes d’améliorations potentielles.

Cross-lingual alignment transfer: a chicken-and-egg story?
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP

Lecture bilingue augmentée par des alignements multi-niveaux (Augmenting bilingual reading with alignment information)
François Yvon | Yong Xu | Marianna Apidianaki | Clément Pillias | Cubaud Pierre
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 5 : Démonstrations

Le travail qui a conduit à cette démonstration combine des outils de traitement des langues multilingues, en particulier l’alignement automatique, avec des techniques de visualisation et d’interaction. Il vise à proposer des pistes pour le développement d’outils permettant de lire simultanément les différentes versions d’un texte disponible en plusieurs langues, avec des applications en lecture de loisir ou en lecture professionnelle.

2015

Why Predicting Post-Edition is so Hard? Failure Analysis of LIMSI Submission to the APE Shared Task
Guillaume Wisniewski | Nicolas Pécheux | François Yvon
Proceedings of the Tenth Workshop on Statistical Machine Translation

Apprentissage par imitation pour l’étiquetage de séquences : vers une formalisation des méthodes d’étiquetage easy-first
Elena Knyazeva | Guillaume Wisniewski | François Yvon
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

De nombreuses méthodes ont été proposées pour accélérer la prédiction d’objets structurés (tels que les arbres ou les séquences), ou pour permettre la prise en compte de dépendances plus riches afin d’améliorer les performances de la prédiction. Ces méthodes reposent généralement sur des techniques d’inférence approchée et ne bénéficient d’aucune garantie théorique aussi bien du point de vue de la qualité de la solution trouvée que du point de vue de leur critère d’apprentissage. Dans ce travail, nous étudions une nouvelle formulation de l’apprentissage structuré qui consiste à voir celui-ci comme un processus incrémental au cours duquel la sortie est construite de façon progressive. Ce cadre permet de formaliser plusieurs approches de prédiction structurée existantes. Grâce au lien que nous faisons entre apprentissage structuré et apprentissage par renforcement, nous sommes en mesure de proposer une méthode théoriquement bien justifiée pour apprendre des méthodes d’inférence approchée. Les expériences que nous réalisons sur quatre tâches de TAL valident l’approche proposée.

A Discriminative Training Procedure for Continuous Translation Models
Quoc-Khanh Do | Alexandre Allauzen | François Yvon
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

Morphology-aware alignments for translation to and from a synthetic language
Franck Burlot | François Yvon
Proceedings of the 12th International Workshop on Spoken Language Translation: Papers

Sentence alignment for literary texts: The state-of-the-art and beyond
Yong Xu | Aurélien Max | François Yvon
Linguistic Issues in Language Technology, Volume 12, 2015 - Literature Lifts up Computational Linguistics

Literary works are becoming increasingly available in electronic formats, thus quickly transforming editorial processes and reading habits. In the context of the global enthusiasm for multilingualism, the rapid spread of e-book readers, such as Amazon Kindle R or Kobo Touch R , fosters the development of a new generation of reading tools for bilingual books. In particular, literary works, when available in several languages, offer an attractive perspective for self-development or everyday leisure reading, but also for activities such as language learning, translation or literary studies. An important issue in the automatic processing of multilingual e-books is the alignment between textual units. Alignment could help identify corresponding text units in different languages, which would be particularly beneficial to bilingual readers and translation professionals. Computing automatic alignments for literary works, however, is a task more challenging than in the case of better behaved corpora such as parliamentary proceedings or technical manuals. In this paper, we revisit the problem of computing high-quality. alignment for literary works. We first perform a large-scale evaluation of automatic alignment for literary texts, which provides a fair assessment of the actual difficulty of this task. We then introduce a two-pass approach, based on a maximum entropy model. Experimental results for novels available in English and French or in English and Spanish demonstrate the effectiveness of our method.

Oublier ce qu’on sait, pour mieux apprendre ce qu’on ne sait pas : une étude sur les contraintes de type dans les modèles CRF
Nicolas Pécheux | Alexandre Allauzen | Thomas Lavergne | Guillaume Wisniewski | François Yvon
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Quand on dispose de connaissances a priori sur les sorties possibles d’un problème d’étiquetage, il semble souhaitable d’inclure cette information lors de l’apprentissage pour simplifier la tâche de modélisation et accélérer les traitements. Pourtant, même lorsque ces contraintes sont correctes et utiles au décodage, leur utilisation lors de l’apprentissage peut dégrader sévèrement les performances. Dans cet article, nous étudions ce paradoxe et montrons que le manque de contraste induit par les connaissances entraîne une forme de sous-apprentissage qu’il est cependant possible de limiter.

LIMSI@WMT’15 : Translation Task
Benjamin Marie | Alexandre Allauzen | Franck Burlot | Quoc-Khanh Do | Julia Ive | Elena Knyazeva | Matthieu Labeau | Thomas Lavergne | Kevin Löser | Nicolas Pécheux | François Yvon
Proceedings of the Tenth Workshop on Statistical Machine Translation

The KIT-LIMSI Translation System for WMT 2015
Thanh-Le Ha | Quoc-Khanh Do | Eunah Cho | Jan Niehues | Alexandre Allauzen | François Yvon | Alex Waibel
Proceedings of the Tenth Workshop on Statistical Machine Translation

Apprentissage discriminant des modèles continus de traduction
Quoc-Khanh Do | Alexandre Allauzen | François Yvon
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Alors que les réseaux neuronaux occupent une place de plus en plus importante dans le traitement automatique des langues, les méthodes d’apprentissage actuelles utilisent pour la plupart des critères qui sont décorrélés de l’application. Cet article propose un nouveau cadre d’apprentissage discriminant pour l’estimation des modèles continus de traduction. Ce cadre s’appuie sur la définition d’un critère d’optimisation permettant de prendre en compte d’une part la métrique utilisée pour l’évaluation de la traduction et d’autre part l’intégration de ces modèles au sein des systèmes de traduction automatique. De plus, cette méthode d’apprentissage est comparée aux critères existants d’estimation que sont le maximum de vraisemblance et l’estimation contrastive bruitée. Les expériences menées sur la tâches de traduction des séminaires TED Talks de l’anglais vers le français montrent la pertinence d’un cadre discriminant d’apprentissage, dont les performances restent toutefois très dépendantes du choix d’une stratégie d’initialisation idoine. Nous montrons qu’avec une initialisation judicieuse des gains significatifs en termes de scores BLEU peuvent être obtenus.

2014

Incremental development of statistical machine translation systems
Li Gong | Aurélien Max | François Yvon
Proceedings of the 11th International Workshop on Spoken Language Translation: Papers

Statistical Machine Translation produces results that make it a competitive option in most machine-assisted translation scenarios. However, these good results often come at a very high computational cost and correspond to training regimes which are unfit to many practical contexts, where the ability to adapt to users and domains and to continuously integrate new data (eg. in post-edition contexts) are of primary importance. In this article, we show how these requirements can be met using a strategy for on-demand word alignment and model estimation. Most remarkably, our incremental system development framework is shown to deliver top quality translation performance even in the absence of tuning, and to surpass a strong baseline when performing online tuning. All these results obtained with great computational savings as compared to conventional systems.

(Much) Faster Construction of SMT Phrase Tables from Large-scale Parallel Corpora (Construction (très) rapide de tables de traduction à partir de grands bi-textes) [in French]
Li Gong | Aurélien Max | François Yvon
Proceedings of TALN 2014 (Volume 3: System Demonstrations)

Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign
Marcello Federico | Sebastian Stüker | François Yvon
Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign

Rule-based Reordering Space in Statistical Machine Translation
Nicolas Pécheux | Alexander Allauzen | François Yvon
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In Statistical Machine Translation (SMT), the constraints on word reorderings have a great impact on the set of potential translations that are explored. Notwithstanding computationnal issues, the reordering space of a SMT system needs to be designed with great care: if a larger search space is likely to yield better translations, it may also lead to more decoding errors, because of the added ambiguity and the interaction with the pruning strategy. In this paper, we study this trade-off using a state-of-the art translation system, where all reorderings are represented in a word lattice prior to decoding. This allows us to directly explore and compare different reordering spaces. We study in detail a rule-based preordering system, varying the length or number of rules, the tagset used, as well as contrasting with oracle settings and purely combinatorial subsets of permutations. We focus on two language pairs: English-French, a close language pair and English-German, known to be a more challenging reordering pair.

The KIT-LIMSI Translation System for WMT 2014
Quoc Khanh Do | Teresa Herrmann | Jan Niehues | Alexander Allauzen | François Yvon | Alex Waibel
Proceedings of the Ninth Workshop on Statistical Machine Translation

LIMSI Submission for WMT’14 QE Task
Guillaume Wisniewski | Nicolas Pécheux | Alexander Allauzen | François Yvon
Proceedings of the Ninth Workshop on Statistical Machine Translation

Combining techniques from different NN-based language models for machine translation
Jan Niehues | Alexander Allauzen | François Yvon | Alex Waibel
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track

This paper presents two improvements of language models based on Restricted Boltzmann Machine (RBM) for large machine translation tasks. In contrast to other continuous space approach, RBM based models can easily be integrated into the decoder and are able to directly learn a hidden representation of the n-gram. Previous work on RBM-based language models do not use a shared word representation and therefore, they might suffer of a lack of generalization for larger contexts. Moreover, since the training step is very time consuming, they are only used for quite small copora. In this work we add a shared word representation for the RBM-based language model by factorizing the weight matrix. In addition, we propose an efficient and tailored sampling algorithm that allows us to drastically speed up the training process. Experiments are carried out on two German to English translation tasks and the results show that the training time could be reduced by a factor of 10 without any drop in performance. Furthermore, the RBM-based model can also be trained on large size corpora.

A Corpus of Machine Translation Errors Extracted from Translation Students Exercises
Guillaume Wisniewski | Natalie Kübler | François Yvon
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we present a freely available corpus of automatic translations accompanied with post-edited versions, annotated with labels identifying the different kinds of errors made by the MT system. These data have been extracted from translation students exercises that have been corrected by a senior professor. This corpus can be useful for training quality estimation tools and for analyzing the types of errors made MT system.

Towards a More Efficient Development of Statistical Machine Translation Systems (Vers un développement plus efficace des systèmes de traduction statistique : un peu de vert dans un monde de BLEU) [in French]
Li Gong | Aurélien Max | François Yvon
Proceedings of TALN 2014 (Volume 2: Short Papers)

Cross-Lingual POS Tagging through Ambiguous Learning: First Experiments (Apprentissage partiellement supervisé d’un étiqueteur morpho-syntaxique par transfert cross-lingue) [in French]
Guillaume Wisniewski | Nicolas Pécheux | Elena Knyazeva | Alexandre Allauzen | François Yvon
Proceedings of TALN 2014 (Volume 1: Long Papers)

LIMSI English-French speech translation system
Natalia Segal | Hélène Bonneau-Maynard | Quoc Khanh Do | Alexandre Allauzen | Jean-Luc Gauvain | Lori Lamel | François Yvon
Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper documents the systems developed by LIMSI for the IWSLT 2014 speech translation task (English→French). The main objective of this participation was twofold: adapting different components of the ASR baseline system to the peculiarities of TED talks and improving the machine translation quality on the automatic speech recognition output data. For the latter task, various techniques have been considered: punctuation and number normalization, adaptation to ASR errors, as well as the use of structured output layer neural network models for speech data.

Topic Adaptation for the Automatic Translation of News Articles (Adaptation thématique pour la traduction automatique de dépêches de presse) [in French]
Souhir Gahbiche-Braham | Hélène Bonneau-Maynard | François Yvon
Proceedings of TALN 2014 (Volume 1: Long Papers)

Discriminative adaptation of continuous space translation models
Quoc-Khanh Do | Alexandre Allauzen | François Yvon
Proceedings of the 11th International Workshop on Spoken Language Translation: Papers

In this paper we explore various adaptation techniques for continuous space translation models (CSTMs). We consider the following practical situation: given a large scale, state-of-the-art SMT system containing a CSTM, the task is to adapt the CSTM to a new domain using a (relatively) small in-domain parallel corpus. Our method relies on the definition of a new discriminative loss function for the CSTM that borrows from both the max-margin and pair-wise ranking approaches. In our experiments, the baseline out-of-domain SMT system is initially trained for the WMT News translation task, and the CSTM is to be adapted to the lecture translation task as defined by IWSLT evaluation campaign. Experimental results show that an improvement of 1.5 BLEU points can be achieved with the proposed adaptation method.

LIMSI @ WMT’14 Medical Translation Task
Nicolas Pécheux | Li Gong | Quoc Khanh Do | Benjamin Marie | Yulia Ivanishcheva | Alexander Allauzen | Thomas Lavergne | Jan Niehues | Aurélien Max | François Yvon
Proceedings of the Ninth Workshop on Statistical Machine Translation

Cross-Lingual Part-of-Speech Tagging through Ambiguous Learning
Guillaume Wisniewski | Nicolas Pécheux | Souhir Gahbiche-Braham | François Yvon
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Comparison of scheduling methods for the learning rate of neural network language models (Modèles de langue neuronaux: une comparaison de plusieurs stratégies d’apprentissage) [in French]
Quoc-Khanh Do | Alexandre Allauzen | François Yvon
Proceedings of TALN 2014 (Volume 1: Long Papers)

Traduire la parole: le cas des TED Talks [Speech translation: the TED Talks case study]
Natalia Segal | Hélène Bonneau-Maynard | François Yvon
Traitement Automatique des Langues, Volume 55, Numéro 2 : Traitement automatique du langage parlé [Spoken language processing]

2013

LIMSI Submission for the WMT‘13 Quality Estimation Task: an Experiment with N-Gram Posteriors
Anil Kumar Singh | Guillaume Wisniewski | François Yvon
Proceedings of the Eighth Workshop on Statistical Machine Translation

Design and Analysis of a Large Corpus of Post-Edited Translations: Quality Estimation, Failure Analysis and the Variability of Post-Edition
Guillaume Wisniewski | Anil Kumar Singh | Natalia Segal | François Yvon
Proceedings of Machine Translation Summit XIV: Papers

A corpus of post-edited translations (Un corpus d’erreurs de traduction) [in French]
Guillaume Wisniewski | Anil Kumar Singh | Natalia Segal | François Yvon
Proceedings of TALN 2013 (Volume 2: Short Papers)

Improving bilingual sub-sentential alignment by sampling-based transpotting
Li Gong | Aurélien Max | François Yvon
Proceedings of the 10th International Workshop on Spoken Language Translation: Papers

In this article, we present a sampling-based approach to improve bilingual sub-sentential alignment in parallel corpora. This approach can be used to align parallel sentences on an as needed basis, and is able to accurately align newly available sentences. We evaluate the resulting alignments on several Machine Translation tasks. Results show that for the tasks considered here, our approach performs on par with the state-of-the-art statistical alignment pipeline giza++/Moses, and obtains superior results in a number of configurations, notably when aligning additional parallel sentence pairs carefully selected to match the test input.

A fully discriminative training framework for Statistical Machine Translation (Un cadre d’apprentissage intégralement discriminant pour la traduction statistique) [in French]
Thomas Lavergne | Alexandre Allauzen | François Yvon
Proceedings of TALN 2013 (Volume 1: Long Papers)

LIMSI @ WMT13
Alexander Allauzen | Nicolas Pécheux | Quoc Khanh Do | Marco Dinarelli | Thomas Lavergne | Aurélien Max | Hai-Son Le | François Yvon
Proceedings of the Eighth Workshop on Statistical Machine Translation

Traitement automatique des entités nommées en arabe: détection et traduction [Automatic processing of Arabic named entities: detection and translation]
Souhir Gahbiche-Braham | Hélène Bonneau-Maynard | François Yvon
Traitement Automatique des Langues, Volume 54, Numéro 2 : Entité Nommées [Named Entities]

2012

Joint Segmentation and POS Tagging for Arabic Using a CRF-based Classifier
Souhir Gahbiche-Braham | Hélène Bonneau-Maynard | Thomas Lavergne | François Yvon
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Arabic is a morphologically rich language, and Arabic texts abound of complex word forms built by concatenation of multiple subparts, corresponding for instance to prepositions, articles, roots prefixes, or suffixes. The development of Arabic Natural Language Processing applications, such as Machine Translation (MT) tools, thus requires some kind of morphological analysis. In this paper, we compare various strategies for performing such preprocessing, using generic machine learning techniques. The resulting tool is compared with two open domain alternatives in the context of a statistical MT task and is shown to be faster than its competitors, with no significant difference in MT quality.

Measuring the Influence of Long Range Dependencies with Neural Network Language Models
Hai Son Le | Alexandre Allauzen | François Yvon
Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT

Towards contextual adaptation for any-text translation
Li Gong | Aurélien Max | François Yvon
Proceedings of the 9th International Workshop on Spoken Language Translation: Papers

Adaptation for Machine Translation has been studied in a variety of ways, using an ideal scenario where the training data can be split into ”out-of-domain” and ”in-domain” corpora, on which the adaptation is based. In this paper, we consider a more realistic setting which does not assume the availability of any kind of ”in-domain” data, hence the name ”any-text translation”. In this context, we present a new approach to contextually adapt a translation model onthe-fly, and present several experimental results where this approach outperforms conventionaly trained baselines. We also present a document-level contrastive evaluation whose results can be easily interpreted, even by non-specialists.

Computing Lattice BLEU Oracle Scores for Machine Translation
Artem Sokolov | Guillaume Wisniewski | François Yvon
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

Non-linear n-best List Reranking with Few Features
Artem Sokolov | Guillaume Wisniewski | François Yvon
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers

In Machine Translation, it is customary to compute the model score of a predicted hypothesis as a linear combination of multiple features, where each feature assesses a particular facet of the hypothesis. The choice of a linear combination is usually justified by the possibility of efficient inference (decoding); yet, the appropriateness of this simple combination scheme to the task at hand is rarely questioned. In this paper, we propose an approach that replaces the linear scoring function with a non-linear scoring function. To investigate the applicability of this approach, we rescore n-best lists generated with a conventional machine translation engine (using a linear scoring function for generating its hypotheses) with a non-linear scoring function learned using the learning-to-rank framework. Moderate, though consistent, gains in BLEU are demonstrated on the WMT’10, WMT’11 and WMT’12 test sets.

WSD for n-best reranking and local language modeling in SMT
Marianna Apidianaki | Guillaume Wisniewski | Artem Sokolov | Aurélien Max | François Yvon
Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation

Préface [Foreword]
Robert Dale | François Yvon
Traitement Automatique des Langues, Volume 53, Numéro 3 : Du bruit dans le signal : gestion des erreurs en traitement automatique des langues [Managing noise in the signal: Error handling in natural language processing]

Repérage des entités nommées pour l’arabe : adaptation non-supervisée et combinaison de systèmes (Named Entity Recognition for Arabic : Unsupervised adaptation and Systems combination) [in French]
Souhir Gahbiche-Braham | Hélène Bonneau-Maynard | Thomas Lavergne | François Yvon
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

Hierarchical Sub-sentential Alignment with Anymalign
Adrien Lardilleux | François Yvon | Yves Lepage
Proceedings of the 16th Annual Conference of the European Association for Machine Translation

Aligning Bilingual Literary Works: a Pilot Study
Qian Yu | Aurélien Max | François Yvon
Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature

Non-Linear Models for Confidence Estimation
Yong Zhuang | Guillaume Wisniewski | François Yvon
Proceedings of the Seventh Workshop on Statistical Machine Translation

LIMSI @ WMT12
Hai-Son Le | Thomas Lavergne | Alexandre Allauzen | Marianna Apidianaki | Li Gong | Aurélien Max | Artem Sokolov | Guillaume Wisniewski | François Yvon
Proceedings of the Seventh Workshop on Statistical Machine Translation

Alignement sous-phrastique hiérarchique avec Anymalign (Hierarchical Sub-Sentential Alignment with Anymalign) [in French]
Adrien Lardilleux | François Yvon | Yves Lepage
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

Continuous Space Translation Models with Neural Networks
Hai Son Le | Alexandre Allauzen | François Yvon
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2011

Measuring the Confusability of Pronunciations in Speech Recognition
Panagiota Karanasou | François Yvon | Lori Lamel
Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing

LIMSI @ WMT11
Alexandre Allauzen | Hélène Bonneau-Maynard | Hai-Son Le | Aurélien Max | Guillaume Wisniewski | François Yvon | Gilles Adda | Josep Maria Crego | Adrien Lardilleux | Thomas Lavergne | Artem Sokolov
Proceedings of the Sixth Workshop on Statistical Machine Translation

Two Ways to Use a Noisy Parallel News Corpus for Improving Statistical Machine Translation
Souhir Gahbiche-Braham | Hélène Bonneau-Maynard | François Yvon
Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web

Généralisation de l’alignement sous-phrastique par échantillonnage (Generalization of sub-sentential alignment by sampling)
Adrien Lardilleux | François Yvon | Yves Lepage
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

L’alignement sous-phrastique consiste à extraire des traductions d’unités textuelles de grain inférieur à la phrase à partir de textes multilingues parallèles alignés au niveau de la phrase. Un tel alignement est nécessaire, par exemple, pour entraîner des systèmes de traduction statistique. L’approche standard pour réaliser cette tâche implique l’estimation successive de plusieurs modèles probabilistes de complexité croissante et l’utilisation d’heuristiques qui permettent d’aligner des mots isolés, puis, par extension, des groupes de mots. Dans cet article, nous considérons une approche alternative, initialement proposée dans (Lardilleux & Lepage, 2008), qui repose sur un principe beaucoup plus simple, à savoir la comparaison des profils d’occurrences dans des souscorpus obtenus par échantillonnage. Après avoir analysé les forces et faiblesses de cette approche, nous montrons comment améliorer la détection d’unités de traduction longues, et évaluons ces améliorations sur des tâches de traduction automatique.

The Quaero program is an international project promoting research and industrial innovation on technologies for automatic analysis and classification of multimedia and multilingual documents. Within the program framework, research organizations and industrial partners collaborate to develop prototypes of innovating applications and services for access and usage of multimedia data. One of the topics addressed is the translation of spoken language. Each year, a project-internal evaluation is conducted by DGA to monitor the technological advances. This work describes the design and results of the 2011 evaluation campaign. The participating partners were RWTH, KIT, LIMSI and SYSTRAN. Their approaches are compared on both ASR output and reference transcripts of speech data for the translation between French and German. The results show that the developed techniques further the state of the art and improve translation quality.

LIMSI’s experiments in domain adaptation for IWSLT11
Thomas Lavergne | Alexandre Allauzen | Hai-Son Le | François Yvon
Proceedings of the 8th International Workshop on Spoken Language Translation: Evaluation Campaign

LIMSI took part in the IWSLT 2011 TED task in the MT track for English to French using the in-house n-code system, which implements the n-gram based approach to Machine Translation. This framework not only allows to achieve state-of-the-art results for this language pair, but is also appealing due to its conceptual simplicity and its use of well understood statistical language models. Using this approach, we compare several ways to adapt our existing systems and resources to the TED task with mixture of language models and try to provide an analysis of the modest gains obtained by training a log linear combination of inand out-of-domain models.

Estimation d’un modèle de traduction à partir d’alignements mot-à-mot non-déterministes (Estimating a translation model from non-deterministic word-to-word alignments)
Nadi Tomeh | Alexandre Allauzen | François Yvon
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Dans les systèmes de traduction statistique à base de segments, le modèle de traduction est estimé à partir d’alignements mot-à-mot grâce à des heuristiques d’extraction et de valuation. Bien que ces alignements mot-à-mot soient construits par des modèles probabilistes, les processus d’extraction et de valuation utilisent ces modèles en faisant l’hypothèse que ces alignements sont déterministes. Dans cet article, nous proposons de lever cette hypothèse en considérant l’ensemble de la matrice d’alignement, d’une paire de phrases, chaque association étant valuée par sa probabilité. En comparaison avec les travaux antérieurs, nous montrons qu’en utilisant un modèle exponentiel pour estimer de manière discriminante ces probabilités, il est possible d’obtenir des améliorations significatives des performances de traduction. Ces améliorations sont mesurées à l’aide de la métrique BLEU sur la tâche de traduction de l’arabe vers l’anglais de l’évaluation NIST MT’09, en considérant deux types de conditions selon la taille du corpus de données parallèles utilisées.

Traitement Automatique des Langues, Volume 52, Numéro 1 : Varia [Varia]
Éric Villemonte de La Clergerie | Béatrice Daille | Yves Lepage | François Yvon
Traitement Automatique des Langues, Volume 52, Numéro 1 : Varia [Varia]

Discriminative Weighted Alignment Matrices For Statistical Machine Translation
Nadi Tomeh | Alexandre Allauzen | François Yvon
Proceedings of the 15th Annual Conference of the European Association for Machine Translation

From n-gram-based to CRF-based Translation Models
Thomas Lavergne | Alexandre Allauzen | Josep Maria Crego | François Yvon
Proceedings of the Sixth Workshop on Statistical Machine Translation

How good are your phrases? Assessing phrase quality with single class classification
Nadi Tomeh | Marco Turchi | Guillaume Wisinewski | Alexandre Allauzen | François Yvon
Proceedings of the 8th International Workshop on Spoken Language Translation: Papers

We present a novel translation quality informed procedure for both extraction and scoring of phrase pairs in PBSMT systems. We reformulate the extraction problem in the supervised learning framework. Our goal is twofold. First, We attempt to take the translation quality into account; and second we incorporating arbitrary features in order to circumvent alignment errors. One-Class SVMs and the Mapping Convergence algorithm permit training a single-class classifier to discriminate between useful and useless phrase pairs. Such classifier can be learned from a training corpus that comprises only useful instances. The confidence score, produced by the classifier for each phrase pairs, is employed as a selection criteria. The smoothness of these scores allow a fine control over the size of the resulting translation model. Finally, confidence scores provide a new accuracy-based feature to score phrase pairs. Experimental evaluation of the method shows accurate assessments of phrase pairs quality even for regions in the space of possible phrase pairs that are ignored by other approaches. This enhanced evaluation of phrase pairs leads to improvements in the translation performance as measured by BLEU.

Minimum Error Rate Training Semiring
Artem Sokolov | François Yvon
Proceedings of the 15th Annual Conference of the European Association for Machine Translation

2010

Recueil et analyse d’un corpus écologique de corrections orthographiques extrait des révisions de Wikipédia
Guillaume Wisniewski | Aurélien Max | François Yvon
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Dans cet article, nous introduisons une méthode à base de règles permettant d’extraire automatiquement de l’historique des éditions de l’encyclopédie collaborative Wikipédia des corrections orthographiques. Cette méthode nous a permis de construire un corpus d’erreurs composé de 72 483 erreurs lexicales (non-word errors) et 74 100 erreurs grammaticales (real-word errors). Il n’existe pas, à notre connaissance, de plus gros corpus d’erreurs écologiques librement disponible. En outre, les techniques mises en oeuvre peuvent être facilement transposées à de nombreuses autres langues. La collecte de ce corpus ouvre de nouvelles perspectives pour l’étude des erreurs fréquentes ainsi que l’apprentissage et l’évaluation des correcteurs orthographiques automatiques. Plusieurs expériences illustrant son intérêt sont proposées.

Practical Very Large Scale CRFs
Thomas Lavergne | Olivier Cappé | François Yvon
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

Local lexical adaptation in Machine Translation through triangulation: SMT helping SMT
Josep Maria Crego | Aurélien Max | François Yvon
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

Proceedings of the 14th Annual Conference of the European Association for Machine Translation
François Yvon | Viggo Hansen
Proceedings of the 14th Annual Conference of the European Association for Machine Translation

LIMSI’s Statistical Translation Systems for WMT’10
Alexandre Allauzen | Josep M. Crego | İlknur Durgar El-Kahlout | François Yvon
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

Contrastive Lexical Evaluation of Machine Translation
Aurélien Max | Josep Maria Crego | François Yvon
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper advocates a complementary measure of translation performance that focuses on the constrastive ability of two or more systems or system versions to adequately translate source words. This is motivated by three main reasons : 1) existing automatic metrics sometimes do not show significant differences that can be revealed by fine-grained focussed human evaluation, 2) these metrics are based on direct comparisons between system hypotheses with the corresponding reference translations, thus ignoring the input words that were actually translated, and 3) as these metrics do not take input hypotheses from several systems at once, fine-grained contrastive evaluation can only be done indirectly. This proposal is illustrated on a multi-source Machine Translation scenario where multiple translations of a source text are available. Significant gains (up to +1.3 BLEU point) are achieved on these experiments, and contrastive lexical evaluation is shown to provide new information that can help to better analyse a system's performance.

LIMSI @ IWSLT 2010
Alexandre Allauzen | Josep M. Crego | İlknur Durgar El-Kahlout | Le Hai-Son | Guillaume Wisniewski | François Yvon
Proceedings of the 7th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes LIMSI’s Statistical Machine Translation systems (SMT) for the IWSLT evaluation, where we participated in two tasks (Talk for English to French and BTEC for Turkish to English). For the Talk task, we studied an extension of our in-house n-code SMT system (the integration of a bilingual reordering model over generalized translation units), as well as the use of training data extracted from Wikipedia in order to adapt the target language model. For the BTEC task, we concentrated on pre-processing schemes on the Turkish side in order to reduce the morphological discrepancies with the English side. We also evaluated the use of two different continuous space language models for such a small size of training data.

Assessing Phrase-Based Translation Models with Oracle Decoding
Guillaume Wisniewski | Alexandre Allauzen | François Yvon
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

Improving Reordering with Linguistically Informed Bilingual n-grams
Josep Maria Crego | François Yvon
Coling 2010: Posters

Training Continuous Space Language Models: Some Practical Issues
Hai Son Le | Alexandre Allauzen | Guillaume Wisniewski | François Yvon
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

Refining Word Alignment with Discriminative Training
Nadi Tomeh | Alexandre Allauzen | François Yvon | Guillaume Wisniewski
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

The quality of statistical machine translation systems depends on the quality of the word alignments that are computed during the translation model training phase. IBM alignment models, as implemented in the GIZA++ toolkit, constitute the de facto standard for performing these computations. The resulting alignments and translation models are however very noisy, and several authors have tried to improve them. In this work, we propose a simple and effective approach, which considers alignment as a series of independent binary classification problems in the alignment matrix. Through extensive feature engineering and the use of stacking techniques, we were able to obtain alignments much closer to manually defined references than those obtained by the IBM models. These alignments also yield better translation models, delivering improved performance in a large scale Arabic to English translation task.

The pay-offs of preprocessing for German-English statistical machine translation
Ilknur Durgar El-Kahlout | Francois Yvon
Proceedings of the 7th International Workshop on Spoken Language Translation: Papers

In this paper, we present the result of our work on improving the preprocessing for German-English statistical machine translation. We implemented and tested various improvements aimed at i) converting German texts to the new orthographic conventions; ii) performing a new tokenization for German; iii) normalizing lexical redundancy with the help of POS tagging and morphological analysis; iv) splitting German compound words with frequency based algorithm and; v) reducing singletons and out-of-vocabulary words. All these steps are performed during preprocessing on the German side. Combining all these processes, we reduced 10% of the singletons, 2% OOV words, and obtained 1.5 absolute (7% relative) BLEU improvement on the WMT 2010 German to English News translation task.

Micro-adaptation lexicale en traduction automatique statistique [Lexical Micro-adaptation in Statistical Machine Translation]
Josep Maria Crego | Gregor Leusch | Aurélien Max | Hermann Ney | François Yvon
Traitement Automatique des Langues, Volume 51, Numéro 2 : Multilinguisme et traitement automatique des langues [Multilingualism and Natural Language Processing]

2009

Gappy Translation Units under Left-to-Right SMT Decoding
Josep M. Crego | François Yvon
Proceedings of the 13th Annual Conference of the European Association for Machine Translation

Improvements in Analogical Learning: Application to Translating Multi-Terms of the Medical Domain
Philippe Langlais | François Yvon | Pierre Zweigenbaum
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

LIMSI‘s Statistical Translation Systems for WMT‘09
Alexandre Allauzen | Josep Crego | Aurélien Max | François Yvon
Proceedings of the Fourth Workshop on Statistical Machine Translation

Sélection de caractéristiques pour les champs aléatoires conditionnels par pénalisation L1 [Selecting features with L1 regularization in Conditional Random Fields]
Nataliya Sokolovska | Olivier Cappé | François Yvon
Traitement Automatique des Langues, Volume 50, Numéro 3 : Apprentissage automatique pour le TAL [Machine Learning for NLP]

Plusieurs langues (bien choisies) valent mieux qu’une : traduction statistique multi-source par renforcement lexical
Josep Maria Crego | Aurélien Max | François Yvon
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Les systèmes de traduction statistiques intègrent différents types de modèles dont les prédictions sont combinées, lors du décodage, afin de produire les meilleures traductions possibles. Traduire correctement des mots polysémiques, comme, par exemple, le mot avocat du français vers l’anglais (lawyer ou avocado), requiert l’utilisation de modèles supplémentaires, dont l’estimation et l’intégration s’avèrent complexes. Une alternative consiste à tirer parti de l’observation selon laquelle les ambiguïtés liées à la polysémie ne sont pas les mêmes selon les langues source considérées. Si l’on dispose, par exemple, d’une traduction vers l’espagnol dans laquelle avocat a été traduit par aguacate, alors la traduction de ce mot vers l’anglais n’est plus ambiguë. Ainsi, la connaissance d’une traduction français!espagnol permet de renforcer la sélection de la traduction avocado pour le système français!anglais. Dans cet article, nous proposons d’utiliser des documents en plusieurs langues pour renforcer les choix lexicaux effectués par un système de traduction automatique. En particulier, nous montrons une amélioration des performances sur plusieurs métriques lorsque les traductions auxiliaires utilisées sont obtenues manuellement.

2008

Robust Similarity Measures for Named Entities Matching
Erwan Moreau | François Yvon | Olivier Cappé
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

Appariement d’entités nommées coréférentes : combinaisons de mesures de similarité par apprentissage supervisé
Erwan Moreau | François Yvon | Olivier Cappé
Actes de la 15ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

L’appariement d’entités nommées consiste à regrouper les différentes formes sous lesquelles apparaît une entité. Pour cela, des mesures de similarité textuelle sont généralement utilisées. Nous proposons de combiner plusieurs mesures afin d’améliorer les performances de la tâche d’appariement. À l’aide d’expériences menées sur deux corpus, nous montrons la pertinence de l’apprentissage supervisé dans ce but, particulièrement avec l’algorithme C4.5.

Transcrire les SMS comme on reconnaît la parole
Catherine Kobus | François Yvon | Géraldine Damnati
Actes de la 15ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article présente une architecture inspirée des systèmes de reconnaissance vocale pour effectuer une normalisation orthographique de messages en « langage SMS ». Nous décrivons notre système de base, ainsi que diverses évolutions de ce système, qui permettent d’améliorer sensiblement la qualité des normalisations produites.

Scaling up Analogical Learning
Philippe Langlais | François Yvon
Coling 2008: Companion volume: Posters

Limsi’s Statistical Translation Systems for WMT‘08
Daniel Déchelotte | Gilles Adda | Alexandre Allauzen | Hélène Bonneau-Maynard | Olivier Galibert | Jean-Luc Gauvain | Philippe Langlais | François Yvon
Proceedings of the Third Workshop on Statistical Machine Translation

Using LDA to detect semantically incoherent documents
Hemant Misra | Olivier Cappé | François Yvon
CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natural Language Learning

Normalizing SMS: are Two Metaphors Better than One ?
Catherine Kobus | François Yvon | Géraldine Damnati
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

2006

Du quatrième de proportion comme principe inductif : une proposition et son application à l’apprentissage de la morphologie [Inference with formal analogical proportions: application to the automatic learning of morphology]
Nicolas Stroppa | François Yvon
Traitement Automatique des Langues, Volume 47, Numéro 1 : Varia [Varia]

Productivité quantitative des suffixations par -ité et -Able dans un corpus journalistique moderne
Natalia Grabar | Delphine Tribout | Georgette Dal | Bernard Fradin | Nabil Hathout | Stéphanie Lignon | Fiammetta Namer | Clément Plancq | François Yvon | Pierre Zweigenbaum
Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Dans ce travail, nous étudions en corpus la productivité quantitative des suffixations par -Able et par -ité du français, d’abord indépendamment l’une de l’autre, puis lorsqu’elles s’enchaînent dérivationnellement (la suffixation en -ité s’applique à des bases en -Able dans environ 15 % des cas). Nous estimons la productivité de ces suffixations au moyen de mesures statistiques dont nous suivons l’évolution par rapport à la taille du corpus. Ces deux suffixations sont productives en français moderne : elles forment de nouveaux lexèmes tout au long des corpus étudiés sans qu’on n’observe de saturation, leurs indices de productivité montrent une évolution stable bien qu’étant dépendante des calculs qui leur sont appliqués. On note cependant que, de façon générale, de ces deux suffixations, c’est la suffixation par -ité qui est la plus fréquente en corpus journalistique, sauf précisément quand -ité s’applique à un adjectif en -Able. Étant entendu qu’un adjectif en -Able et le nom en -ité correspondant expriment la même propriété, ce résultat indique que la complexité de la base est un paramètre à prendre en considération dans la formation du lexique possible.

2005

An Analogical Learner for Morphological Analysis
Nicolas Stroppa | François Yvon
Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005)

2004

Analogies dans les séquences : un solveur à états finis
Nicolas Stroppa | François Yvon
Actes de la 11ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

L’apprentissage par analogie se fonde sur un principe inférentiel potentiellement pertinent pour le traitement des langues naturelles. L’utilisation de ce principe pour des tâches d’analyse linguistique présuppose toutefois une définition formelle de l’analogie entre séquences. Dans cet article, nous proposons une telle définition et montrons qu’elle donne lieu à l’implantation efficace d’un solveur d’équations analogiques sous la forme d’un transducteur fini. Munis de ces résultats, nous caractérisons empiriquement l’extension analogique de divers langages finis, correspondant à des dictionnaires de quatre langues.

2003

Apprentissage Automatique de Paraphrases pour l’Amélioration d’un Système de Questions-Réponses
Florence Duclaye | Olivier Collin | François Yvon
Actes de la 10ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Dans cet article, nous présentons une méthodologie d’apprentissage faiblement supervisé pour l’extraction automatique de paraphrases à partir du Web. À partir d’un seule exemple de paire (prédicat, arguments), un corpus est progressivement accumulé par sondage duWeb. Les phases de sondage alternent avec des phases de filtrage, durant lesquelles les paraphrases les moins plausibles sont éliminées à l’aide d’une procédure de clustering non supervisée. Ce mécanisme d’apprentissage s’appuie sur un système de Questions-Réponses existant et les paraphrases apprises seront utilisées pour en améliorer le rappel. Nous nous concentrons ici sur le mécanisme d’apprentissage de ce système et en présentons les premiers résultats.

2002

Using the Web as a Linguistic Resource for Learning Reformulations Automatically
Florence Duclaye | François Yvon | Olivier Collin
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

2000

A French Phonetic Lexicon with Variants for Speech and Language Processing
Philippe Boula de Mareüil | Christophe d’Alessandro | François Yvon | Véronique Aubergé | Jacqueline Vaissière | Angélique Amelot
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

1997

Paradigmatic Cascades: A Linguistically Sound Model of Pronunciation by Analogy
François Yvon
35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics

Co-authors

Franck Burlot 14

Hinrich Schütze 14

Lauriane Aufrant 12

Hélène Bonneau-Maynard 11

Quoc Khanh Do 11

Amir Hossein Kargaran 9

Minh Quang Pham 9

Nicolas Pécheux 9

Elena Knyazeva 8

Rachel Bawden 7

Artem Sokolov 7

Alexander Allauzen 6

Ondřej Bojar 6

Souhir Gahbiche-Braham 6

Nicolas Ballier 5

Olivier Cappé 5

Masoud Jalili Sabet 5

Ophélie Lacroix 5

Jean Senellart 5

Marianna Apidianaki 4

Laurent Besacier 4

Maxime Bouthors 4

François Buet 4

Matthieu Dubois 4

Natalie Kübler 4

Matthieu Labeau 4

Adrien Lardilleux 4

Pablo Piantanida 4

Natalia Segal 4

Sadaf Abdul-Rauf 3

Frédéric Blain 3

Ilknur Durgar El-Kahlout 3

Pierre Godard 3

Anh Khoa Ngo Ho 3

Mathilde Huguin 3

Dávid Javorský 3

Philippe Langlais 3

Alexandra Mestivier 3

Laurent Romary 3

Anil Kumar Singh 3

Nicolas Stroppa 3

Sebastian Stüker 3

Éric Villemonte de la Clergerie 3

Mingyang Wang 3

Martine Adda-Decker 2

Olivier Collin 2

Géraldine Damnati 2

Hal Daumé III 2

Florence Duclaye 2

Philipp Dufter 2

Marcello Federico 2

Jean-Luc Gauvain 2

Catherine Kobus 2

Benjamin Marie 2

André F. T. Martins 2

Jean-François Nominé 2

Jan-Thorsten Peter 2

Clément Pillias 2

Mārcis Pinnis 2

Annie Rialland 2

Lütfi Kerem Senel 2

Silvia Severini 2

Pierre Zweigenbaum 2

Tamer Alkhouli 1

Angélique Amelot 1

Dimitra Anastasiou 1

Véronique Aubergé 1

Albina Auksoriūtė 1

Gerhard Backfried 1

Loic Barrault 1

Éléonor Bartenlian 1

Jasmijn Bastings 1

Juan Benjumea 1

Luisa Bentivogli 1

Marc Benzahra 1

Elise Bertin-Lemée 1

Kalina Bontcheva 1

Karim Boudahmane 1

Fethi Bougares 1

Philippe Boula de Mareüil 1

António Branco 1

Fabienne Braune 1

Gerhard Budin 1

Bianka Buschbeck 1

Marine Carpuat 1

Laurène Cave 1

Mauro Cettolo 1

Runsheng Chen 1

Monojit Choudhury 1

Khalid Choukri 1

Éric Clergerie 1

Jamison Cooper-Leavitt 1

Pierre Cubaud 1

José Cornejo Cárcamo 1

Walter Daelemans 1

Nicolas Dahan 1

Béatrice Daille 1

Georgette Dal 1

Koenraad De Smedt 1

Manon Delorme 1

Nicolas Devatine 1

Marco Dinarelli 1

Daniel Déchelotte 1

Raphaël Esamotunu 1

Bernard Fradin 1

Alexander Fraser 1

Markus Freitag 1

Olivier Galibert 1

Ge Gao (高歌) 1

Radovan Garabík 1

Mercedes García-Martínez 1

Patrick Gatellier 1

Maria Gavriilidou 1

Natalia Grabar 1

Alvin Grissom II 1

Dagmar Gromann 1

Stig-Arne Grönroos 1

José Manuel Gómez-Pérez 1

Ahmad Dawar Hakimi 1

Thierry Hamon 1

Nabil Hathout 1

Stefanie Hegele 1

Teresa Herrmann 1

Lea Hirlimann 1

Matthias Huck 1

Gabriel Illouz 1

Ayyoob ImaniGooghari 1

Morten Irgens 1

Yulia Ivanishcheva 1

Alina Karakanta 1

Panagiota Karanasou 1

Marzena Karpinska 1

Elaine C. Khoong 1

Maarit Koponen 1

Guy-Noel Kouarata 1

Cvetana Krstev 1

Joachim Köhler 1

Felicia Körner 1

Margot Lacour 1

Laure Le Bars 1

Gaël Lejeune 1

Pierre-Antoine Lequeu 1

Gregor Leusch 1

William Lewis 1

Stéphanie Lignon 1

Anne-Laure Ligozat 1

Krister Lindén 1

Paul Lukowicz 1

Andrea Lösch 1

Bernardo Magnini 1

Katrin Marheinecke 1

Joseph Mariani 1

Mona Michelot 1

Joachim Minder 1

Ali Modarressi 1

Markus Mueller 1

Satoshi Nakamura 1

Fiammetta Namer 1

Tommi Nieminen 1

Nafiseh Nikeghbal 1

Mary Nurminen 1

Douglas W. Oard 1

Maciej Ogrodniczuk 1

Etienne Ollion 1

Bolette Sandford Pedersen 1

Stephan Peitz 1

Ngoc-Quan Pham 1

Cubaud Pierre 1

Vijini Pilana Liyanage 1

Stelios Piperidis 1

Benjamin Piwowarski 1

Clément Plancq 1

Barbara Plank 1

Maja Popović 1

Christoph Prinz 1

Joanna Radoła 1

Vinit Ravishankar 1

José Carlos Rosales Núñez 1

Michael Rosner 1

Eirikur Rögnvaldsson 1

Pooyan Safari 1

Yves Scherrer 1

Helmut Schmid 1

Rico Sennrich 1

Nazanin Shafiabadi 1

Michel Simard 1

Inguna Skadiņa 1

Philipp Slusallek 1

Nataliya Sokolovska 1

Aleš Tamchyna 1

Delphine Tribout 1

Panagiotis Tsolakis 1

Jacqueline Vaissiere 1

Andrejs Vasiļjevs 1

Tamás Váradi 1

Tonio Wandmacher 1

Philip Williams 1

Guillaume Wisinewski 1

Joern Wuebker 1

Orgest Xhelili 1

Marcely Zanon Boito 1

Christophe d’Alessandro 1

Josef van Genabith 1

Valters Šics 1

Venues